Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document different alpaka terms and it relationships #2275

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

SimeonEhrig
Copy link
Member

During my work on PR #2180 I had some trouble to add the memory visibility on the correct concepts. Therefore I had a offline discussion with @psychocoderHPC and started to write down some terms, their meanings and their relationships.

It is pretty hard to document this stuff, therefore I think my current documentation is not complete and misses precisely documentation. Please comment it.

Accelerator
-----------

- A ``accelerator`` is a index mapping function. It distributes the index space to the chunks. The ``accelerator`` maps a continues index space to a blocked index domain decomposition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is much more than that.
For example, the OpenMP and TBB accelerators implement different kind of parallelism for the Cpu device.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know. Accelerator is pretty hard to explain. Actual I only wrote down, what @psychocoderHPC told me, because my personal description is to inprecisely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would discuss the documentation of accelerator in the next alpaka VC. For me, this is a complicated subject.

-----

- Stores operations which should be executed on a ``device``.
- Operations can be ``TaskKernels``, ``Events``, ``Sets`` and ``Copies``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also host tasks (that are not kernels).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory allocations can also be submitted to a queue.

Copy link
Member Author

@SimeonEhrig SimeonEhrig May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is host task also a trait/concept in alpaka? I never heard about this.

Copy link
Contributor

@fwyzard fwyzard May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm, I'm not sure about a concept.

I think you can submit anything that can be called with no arguments:

#include <chrono>
#include <iostream>
using namespace std::chrono_literals;

#include <alpaka/alpaka.hpp>

int main(void) {

#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
  using Platform = alpaka::PlatformCudaRt;
#else
  using Platform = alpaka::PlatformCpu;
#endif  // ALPAKA_ACC_GPU_CUDA_ENABLED
  Platform platform;

  // use the first available device
  using Device = alpaka::Dev<Platform>;
  Device device = alpaka::getDevByIdx(platform, 0u);

  // create an asynchronous queue
  using Queue = alpaka::Queue<Device, alpaka::NonBlocking>;
  Queue queue{device};

  std::cout << "before submitting\n";

  // submit a host task to an asynchronous queue
  alpaka::enqueue(queue, [](){
    // asynchronous host task executed from a device queue
    std::cout << "async hello world from " << alpaka::core::demangled<Device> << '\n';
  });

  // sleep for 1s
  std::this_thread::sleep_for(1000ms);

  std::cout << "after submitting, before waiting\n";

  // wait for the asynchronous task to complete
  alpaka::wait(queue);

  std::cout << "after waiting\n";

  return 0;
}

Compiled with

$ g++ -std=c++17 -O2 -DALPAKA_ACC_GPU_CUDA_ENABLED -DALPAKA_HOST_ONLY -I/home/fwyzard/src/alpaka-group/alpaka/include -I/usr/local/cuda/include test.cc -L/usr/local/cuda/lib64 -lcudart -lcuda -lrt -o test

should print

$ ./test 
before submitting
async hello world from alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt>
after submitting, before waiting
after waiting

Compiled with

$ g++ -std=c++17 -O2 -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED -I/home/fwyzard/src/alpaka-group/alpaka/include test.cc -L/usr/local/cuda/lib64 -lrt -o test

should print

$ ./test 
before submitting
async hello world from alpaka::DevCpu
after submitting, before waiting
after waiting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware, that is possible. I use a task kernel and a cpu device for it. But I found the feature in the queue tests.

I will add it. Looks like my documentation makes sense ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use a task kernel and a cpu device for it.

That also works.

The point of using a host task with a device queue is to run the host code only after the device operations are complete.

A similar functionality in CUDA is provided by cudaLaunchHostFunc, but the alpaka version does not have the limitation that it cannot call other device API functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see why the functionality makes sense.


Queue
-----

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add that a queue can be either "synchronous" (blocking) or "asynchronous" (non-blocking), and that this may determine how the work is submitted to the device, and if a synchronisation is performed after each operation.

@mehmetyusufoglu
Copy link
Contributor

The attached image is great, don't we need to show a relation from accelerator to device since in the code we use accelerator type not the platform get the device?


- A ``device`` represent a compute unit, such as a CPU or a GPU.
- Each ``device`` is bounded to a specific ``platform``.
- Each ``device`` can have any number of ``queues``.
Copy link
Contributor

@mehmetyusufoglu mehmetyusufoglu May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Device can be attached to data (in an alpaka buf or view) which makes easy memcpy calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a relation between accelerator and TaskKernel, which I didn't add it. And I'm not sure if I should add it in this diagram or in second. Same for your suggestion.

- Stores operations which should be executed on a ``device``.
- Operations can be ``TaskKernels``, ``HostTasks``, ``Events``, ``Sets`` and ``Copies``.
- Each ``queue`` is bounded to a specific ``device``.
- A Queue can be ``Blocking`` (host thread is waiting for finishing the API call) or ``NonBlocking`` (host thread continues after calling the API independent if the call finished or not).
Copy link
Contributor

@mehmetyusufoglu mehmetyusufoglu May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Blocking queues only block the caller thread of exec or enqueue, does not block other queues

- Operations can be ``TaskKernels``, ``HostTasks``, ``Events``, ``Sets`` and ``Copies``.
- Each ``queue`` is bounded to a specific ``device``.
- A Queue can be ``Blocking`` (host thread is waiting for finishing the API call) or ``NonBlocking`` (host thread continues after calling the API independent if the call finished or not).
- All operations in a queue will be executed sequentiell.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- All operations in a queue will be executed sequentiell.
- All operations in a queue will be executed sequentially.

Set
---

- A ``Set`` set byte wise a memory to a specific value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- A ``Set`` set byte wise a memory to a specific value.
- A ``Set`` sets a memory to a specific value byte-wise.

@fwyzard
Copy link
Contributor

fwyzard commented May 23, 2024

in the code we use accelerator type not the platform get the device?

This is actually the wrong way of using accelerators, platforms and devices.

@SimeonEhrig
Copy link
Member Author

in the code we use accelerator type not the platform get the device?

This is actually the wrong way of using accelerators, platforms and devices.

This actual on of the motivations of this documentation. Our examples are written in an accelerator centered way, but this actual not correct. And the idea that the accelerator is the central element causes some problems, if you want to implement new features.

@fwyzard
Copy link
Contributor

fwyzard commented May 23, 2024

Our examples are written in an accelerator centered way, but this actual not correct.

It does make sense to write the unit tests like that: we want to test the accelerator, so we need to use the corresponding device.

But in general, it makes more sense for an application to start from the platform, enumerate the available devices, pick one, create a queue, and choose the accelerator that best matches that device and the problem at hand.

@SimeonEhrig
Copy link
Member Author

@alpaka-group/alpaka-maintainers I'm not able to finish the PR because I do not really understand the alpaka concepts. Somebody can overtake the PR. If the PR is not finished until next release it will be closed without merge.

@psychocoderHPC psychocoderHPC marked this pull request as ready for review June 10, 2024 08:49
@psychocoderHPC
Copy link
Member

@alpaka-group/alpaka-maintainers Since @SimeonEhrig asked for help I changed some points which IMO are important.
My goal is not to have the perfect description because what @SimeonEhrig did is already a huge improvement compared to the current state.

Accelerator
-----------

- Describes "how" a kernel work division is mapped to device threads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We give work division separately to the kernel, not use the workdiv from the acc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user is giving the work division to the kernel exec call but at the end this is only the interface. I do not like describe the interface in words, my intention is to describe what different concepts do.
In this case the accelerator performs the mapping from indices to physical threads.

Work division
-------------

- Describes the domain decomposition of a contiguous N-dimensional index domain in ``blocks``, ``threads`` within a ``block``, and ``elements`` per ``thread``.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Describes the domain decomposition of a contiguous N-dimensional index domain in ``blocks``, ``threads`` within a ``block``, and ``elements`` per ``thread``.
- Describes the domain decomposition of a contiguous N-dimensional index domain in ``blocks``, ``threads`` and ``elements``. A ``block`` contains one ore more ``threads`` and a ``thread`` one or more ``elements``.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

- N-dimensional work divisions (1D, 2D, 3D) are supported.
- Holds implementations of shared memory, atomic operations, math operations etc.
- ``Accelerators`` are instantiated only when a kernel is executed, and can only be accessed in device code.
- Each device function can (should) be templated on the accelerator type, and take an accelerator as its first argument.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the indentation is not correct. It does not render as sub items.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@psychocoderHPC
Copy link
Member

@alpaka-group/alpaka-maintainers ping

@SimeonEhrig
Copy link
Member Author

@psychocoderHPC I can only approve it only via text because I'm the author of the PR.

Work division
-------------

* Describes the domain decomposition of a contiguous N-dimensional index domain in ``blocks``, ``threads`` and ``elements``. A ``block`` contains one or more ``threads`` and a ``thread`` process one or more ``elements``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Describes the domain decomposition of a contiguous N-dimensional index domain in ``blocks``, ``threads`` and ``elements``. A ``block`` contains one or more ``threads`` and a ``thread`` process one or more ``elements``.
* Describes the domain decomposition of a contiguous N-dimensional index domain in ``blocks``, ``threads`` and ``elements``. A ``block`` contains one or more ``threads`` and each ``thread`` processes one or more ``elements``.


* Stores tasks which should be executed on a ``device``.
* Operations can be ``TaskKernels``, ``HostTasks``, ``Events``, ``Sets`` and ``Copies``.
* A Queue can be ``Blocking`` (host thread is waiting for finishing the API call) or ``NonBlocking`` (host thread continues after calling the API independent if the call finished or not).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* A Queue can be ``Blocking`` (host thread is waiting for finishing the API call) or ``NonBlocking`` (host thread continues after calling the API independent if the call finished or not).
* A Queue can be ``Blocking`` (the host thread waits for each API call or task to finish) or ``NonBlocking`` (the host thread continues after making an API call or submitting a task without waiting for it to finish, or even start).

* Stores tasks which should be executed on a ``device``.
* Operations can be ``TaskKernels``, ``HostTasks``, ``Events``, ``Sets`` and ``Copies``.
* A Queue can be ``Blocking`` (host thread is waiting for finishing the API call) or ``NonBlocking`` (host thread continues after calling the API independent if the call finished or not).
* All operations in a queue will be executed sequential in FIFO order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* All operations in a queue will be executed sequential in FIFO order.
* All operations in a queue will be executed sequentially in FIFO order.

* Operations can be ``TaskKernels``, ``HostTasks``, ``Events``, ``Sets`` and ``Copies``.
* A Queue can be ``Blocking`` (host thread is waiting for finishing the API call) or ``NonBlocking`` (host thread continues after calling the API independent if the call finished or not).
* All operations in a queue will be executed sequential in FIFO order.
* Operations in different queues can run in parallel even on blocking queues.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With blocking queues, this requires using different host tasks, right ?

* All operations in a queue will be executed sequential in FIFO order.
* Operations in different queues can run in parallel even on blocking queues.
* ``wait()`` can be executed for queues to block the caller host thread until all previous enqueued work is finished.
* Each ``queue`` is bounded to a specific ``device``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Each ``queue`` is bounded to a specific ``device``.
* Each ``queue`` is bound to a specific ``device``.

Event
-----

* A ``event`` is a marker in a ``queue``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* A ``event`` is a marker in a ``queue``.
* An ``event`` is a marker in a ``queue``.

@fwyzard
Copy link
Contributor

fwyzard commented Jun 25, 2024

Should it be Mem Set and Mem Copy instead of just Set and Copy ?

@fwyzard
Copy link
Contributor

fwyzard commented Jun 25, 2024

@SimeonEhrig can you remove [RFC] from the title of the PR ?

@fwyzard fwyzard changed the title [RFC] document different alpaka terms and it relationships Document different alpaka terms and it relationships Jun 25, 2024
@fwyzard
Copy link
Contributor

fwyzard commented Jun 25, 2024

Ah, no matter, I didn't know I could do it myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants