-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document different alpaka terms and it relationships #2275
base: develop
Are you sure you want to change the base?
Conversation
docs/source/basic/terms.rst
Outdated
Accelerator | ||
----------- | ||
|
||
- A ``accelerator`` is a index mapping function. It distributes the index space to the chunks. The ``accelerator`` maps a continues index space to a blocked index domain decomposition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is much more than that.
For example, the OpenMP and TBB accelerators implement different kind of parallelism for the Cpu device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know. Accelerator is pretty hard to explain. Actual I only wrote down, what @psychocoderHPC told me, because my personal description is to inprecisely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would discuss the documentation of accelerator in the next alpaka VC. For me, this is a complicated subject.
docs/source/basic/terms.rst
Outdated
----- | ||
|
||
- Stores operations which should be executed on a ``device``. | ||
- Operations can be ``TaskKernels``, ``Events``, ``Sets`` and ``Copies``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also host tasks (that are not kernels).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory allocations can also be submitted to a queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is host task also a trait/concept in alpaka? I never heard about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhm, I'm not sure about a concept.
I think you can submit anything that can be called with no arguments:
#include <chrono>
#include <iostream>
using namespace std::chrono_literals;
#include <alpaka/alpaka.hpp>
int main(void) {
#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
using Platform = alpaka::PlatformCudaRt;
#else
using Platform = alpaka::PlatformCpu;
#endif // ALPAKA_ACC_GPU_CUDA_ENABLED
Platform platform;
// use the first available device
using Device = alpaka::Dev<Platform>;
Device device = alpaka::getDevByIdx(platform, 0u);
// create an asynchronous queue
using Queue = alpaka::Queue<Device, alpaka::NonBlocking>;
Queue queue{device};
std::cout << "before submitting\n";
// submit a host task to an asynchronous queue
alpaka::enqueue(queue, [](){
// asynchronous host task executed from a device queue
std::cout << "async hello world from " << alpaka::core::demangled<Device> << '\n';
});
// sleep for 1s
std::this_thread::sleep_for(1000ms);
std::cout << "after submitting, before waiting\n";
// wait for the asynchronous task to complete
alpaka::wait(queue);
std::cout << "after waiting\n";
return 0;
}
Compiled with
$ g++ -std=c++17 -O2 -DALPAKA_ACC_GPU_CUDA_ENABLED -DALPAKA_HOST_ONLY -I/home/fwyzard/src/alpaka-group/alpaka/include -I/usr/local/cuda/include test.cc -L/usr/local/cuda/lib64 -lcudart -lcuda -lrt -o test
should print
$ ./test
before submitting
async hello world from alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt>
after submitting, before waiting
after waiting
Compiled with
$ g++ -std=c++17 -O2 -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED -I/home/fwyzard/src/alpaka-group/alpaka/include test.cc -L/usr/local/cuda/lib64 -lrt -o test
should print
$ ./test
before submitting
async hello world from alpaka::DevCpu
after submitting, before waiting
after waiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not aware, that is possible. I use a task kernel and a cpu device for it. But I found the feature in the queue tests.
I will add it. Looks like my documentation makes sense ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use a task kernel and a cpu device for it.
That also works.
The point of using a host task with a device queue is to run the host code only after the device operations are complete.
A similar functionality in CUDA is provided by cudaLaunchHostFunc
, but the alpaka version does not have the limitation that it cannot call other device API functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see why the functionality makes sense.
|
||
Queue | ||
----- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add that a queue can be either "synchronous" (blocking) or "asynchronous" (non-blocking), and that this may determine how the work is submitted to the device, and if a synchronisation is performed after each operation.
e5deeea
to
e43eb8b
Compare
The attached image is great, don't we need to show a relation from accelerator to device since in the code we use accelerator type not the platform get the device? |
docs/source/basic/terms.rst
Outdated
|
||
- A ``device`` represent a compute unit, such as a CPU or a GPU. | ||
- Each ``device`` is bounded to a specific ``platform``. | ||
- Each ``device`` can have any number of ``queues``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Device can be attached to data (in an alpaka buf or view) which makes easy memcpy calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a relation between accelerator and TaskKernel, which I didn't add it. And I'm not sure if I should add it in this diagram or in second. Same for your suggestion.
docs/source/basic/terms.rst
Outdated
- Stores operations which should be executed on a ``device``. | ||
- Operations can be ``TaskKernels``, ``HostTasks``, ``Events``, ``Sets`` and ``Copies``. | ||
- Each ``queue`` is bounded to a specific ``device``. | ||
- A Queue can be ``Blocking`` (host thread is waiting for finishing the API call) or ``NonBlocking`` (host thread continues after calling the API independent if the call finished or not). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Blocking queues only block the caller thread of exec or enqueue, does not block other queues
docs/source/basic/terms.rst
Outdated
- Operations can be ``TaskKernels``, ``HostTasks``, ``Events``, ``Sets`` and ``Copies``. | ||
- Each ``queue`` is bounded to a specific ``device``. | ||
- A Queue can be ``Blocking`` (host thread is waiting for finishing the API call) or ``NonBlocking`` (host thread continues after calling the API independent if the call finished or not). | ||
- All operations in a queue will be executed sequentiell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- All operations in a queue will be executed sequentiell. | |
- All operations in a queue will be executed sequentially. |
docs/source/basic/terms.rst
Outdated
Set | ||
--- | ||
|
||
- A ``Set`` set byte wise a memory to a specific value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- A ``Set`` set byte wise a memory to a specific value. | |
- A ``Set`` sets a memory to a specific value byte-wise. |
This is actually the wrong way of using accelerators, platforms and devices. |
This actual on of the motivations of this documentation. Our examples are written in an accelerator centered way, but this actual not correct. And the idea that the accelerator is the central element causes some problems, if you want to implement new features. |
It does make sense to write the unit tests like that: we want to test the accelerator, so we need to use the corresponding device. But in general, it makes more sense for an application to start from the platform, enumerate the available devices, pick one, create a queue, and choose the accelerator that best matches that device and the problem at hand. |
@alpaka-group/alpaka-maintainers I'm not able to finish the PR because I do not really understand the alpaka concepts. Somebody can overtake the PR. If the PR is not finished until next release it will be closed without merge. |
55ca596
to
0d05440
Compare
@alpaka-group/alpaka-maintainers Since @SimeonEhrig asked for help I changed some points which IMO are important. |
docs/source/basic/terms.rst
Outdated
Accelerator | ||
----------- | ||
|
||
- Describes "how" a kernel work division is mapped to device threads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We give work division separately to the kernel, not use the workdiv from the acc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user is giving the work division to the kernel exec call but at the end this is only the interface. I do not like describe the interface in words, my intention is to describe what different concepts do.
In this case the accelerator performs the mapping from indices to physical threads.
docs/source/basic/terms.rst
Outdated
Work division | ||
------------- | ||
|
||
- Describes the domain decomposition of a contiguous N-dimensional index domain in ``blocks``, ``threads`` within a ``block``, and ``elements`` per ``thread``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Describes the domain decomposition of a contiguous N-dimensional index domain in ``blocks``, ``threads`` within a ``block``, and ``elements`` per ``thread``. | |
- Describes the domain decomposition of a contiguous N-dimensional index domain in ``blocks``, ``threads`` and ``elements``. A ``block`` contains one ore more ``threads`` and a ``thread`` one or more ``elements``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
docs/source/basic/terms.rst
Outdated
- N-dimensional work divisions (1D, 2D, 3D) are supported. | ||
- Holds implementations of shared memory, atomic operations, math operations etc. | ||
- ``Accelerators`` are instantiated only when a kernel is executed, and can only be accessed in device code. | ||
- Each device function can (should) be templated on the accelerator type, and take an accelerator as its first argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the indentation is not correct. It does not render as sub items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
0d05440
to
2c53167
Compare
@alpaka-group/alpaka-maintainers ping |
@psychocoderHPC I can only approve it only via text because I'm the author of the PR. |
Work division | ||
------------- | ||
|
||
* Describes the domain decomposition of a contiguous N-dimensional index domain in ``blocks``, ``threads`` and ``elements``. A ``block`` contains one or more ``threads`` and a ``thread`` process one or more ``elements``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Describes the domain decomposition of a contiguous N-dimensional index domain in ``blocks``, ``threads`` and ``elements``. A ``block`` contains one or more ``threads`` and a ``thread`` process one or more ``elements``. | |
* Describes the domain decomposition of a contiguous N-dimensional index domain in ``blocks``, ``threads`` and ``elements``. A ``block`` contains one or more ``threads`` and each ``thread`` processes one or more ``elements``. |
|
||
* Stores tasks which should be executed on a ``device``. | ||
* Operations can be ``TaskKernels``, ``HostTasks``, ``Events``, ``Sets`` and ``Copies``. | ||
* A Queue can be ``Blocking`` (host thread is waiting for finishing the API call) or ``NonBlocking`` (host thread continues after calling the API independent if the call finished or not). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* A Queue can be ``Blocking`` (host thread is waiting for finishing the API call) or ``NonBlocking`` (host thread continues after calling the API independent if the call finished or not). | |
* A Queue can be ``Blocking`` (the host thread waits for each API call or task to finish) or ``NonBlocking`` (the host thread continues after making an API call or submitting a task without waiting for it to finish, or even start). |
* Stores tasks which should be executed on a ``device``. | ||
* Operations can be ``TaskKernels``, ``HostTasks``, ``Events``, ``Sets`` and ``Copies``. | ||
* A Queue can be ``Blocking`` (host thread is waiting for finishing the API call) or ``NonBlocking`` (host thread continues after calling the API independent if the call finished or not). | ||
* All operations in a queue will be executed sequential in FIFO order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* All operations in a queue will be executed sequential in FIFO order. | |
* All operations in a queue will be executed sequentially in FIFO order. |
* Operations can be ``TaskKernels``, ``HostTasks``, ``Events``, ``Sets`` and ``Copies``. | ||
* A Queue can be ``Blocking`` (host thread is waiting for finishing the API call) or ``NonBlocking`` (host thread continues after calling the API independent if the call finished or not). | ||
* All operations in a queue will be executed sequential in FIFO order. | ||
* Operations in different queues can run in parallel even on blocking queues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With blocking queues, this requires using different host tasks, right ?
* All operations in a queue will be executed sequential in FIFO order. | ||
* Operations in different queues can run in parallel even on blocking queues. | ||
* ``wait()`` can be executed for queues to block the caller host thread until all previous enqueued work is finished. | ||
* Each ``queue`` is bounded to a specific ``device``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Each ``queue`` is bounded to a specific ``device``. | |
* Each ``queue`` is bound to a specific ``device``. |
Event | ||
----- | ||
|
||
* A ``event`` is a marker in a ``queue``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* A ``event`` is a marker in a ``queue``. | |
* An ``event`` is a marker in a ``queue``. |
Should it be |
@SimeonEhrig can you remove |
Ah, no matter, I didn't know I could do it myself. |
During my work on PR #2180 I had some trouble to add the memory visibility on the correct concepts. Therefore I had a offline discussion with @psychocoderHPC and started to write down some terms, their meanings and their relationships.
It is pretty hard to document this stuff, therefore I think my current documentation is not complete and misses precisely documentation. Please comment it.