Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of GPUs #474

Open
hakonanes opened this issue Nov 19, 2021 · 2 comments
Open

Use of GPUs #474

hakonanes opened this issue Nov 19, 2021 · 2 comments
Labels
enhancement New feature or request help wanted Would be nice if someone could help

Comments

@hakonanes
Copy link
Member

hakonanes commented Nov 19, 2021

We should try to take advantage of GPUs by writing some GPU kernels. I don't have an NVIDIA GPU available, so my choice would be to use PyOpenCL instead of CuPy.

@drowenhorst-nrl have written some kernels in PyEBSDIndex that we could take inspiration from, e.g. the static background subtraction, used in one of the Radon transform functions. Such a kernel could be an alternative to our background subtraction.

In general, I think more per pattern operations in the kikuchipy.pattern module could be replaced by PyOpenCL kernels, like image rescaling. We have CPU acceleration from Numba here, but it would be good to test GPU acceleration as well.

Other resources:

@hakonanes hakonanes added the enhancement New feature or request label Nov 19, 2021
@drowenhorst-nrl
Copy link

Jump in with a quick note: the Achilles heal of GPU compute (excluding new platforms like the Apple M1 with really fast integrated graphics) is the time to transfer to/from the GPU. Thus, it may be hard to beat the CPU when dealing with single, relatively small patterns (1k x1k probably is near the smallest one might consider). In PyEBSDIndex this is mitigated by inherently performing many calculations on a large batch of patterns. However, it may well be worth it - some simple tests would help. Yes, take a look at my kernels. Be aware that I interleave the batches of patterns, thus each pattern looks like it has N channels, and is WxH in 2D size (vs being WxH with N slices in a volume). In my case this significantly reduced my global memory fetches within the GPU.

Also look at the gputools package. https://github.com/maweigert/gputools they might have a lot of what you want. They inspired a lot of my initial efforts.

Final note: the GPU compute landscape is a mess. Yes, OpenCL is currently the most cross-platform framework, but Apple has said that OpenCL is officially depreciated (but not yet removed from latest OS). CUDA is NVIDIA only. Apple and NVIDIA still hate each other. Windows and OpenCL can be done, but not as easy as others ...

I think a lot of commercial software that is cross platform will rewrite for the multiple frameworks/platforms. A lot of the machine learning community says NVIDIA/CUDA or nothing. You might want to take a look at Vulkan/MoltenVK. It is definitely more geared towards rendering rather than compute, but that also might serve your needs well. I have not fully comprehended the interplay between OpenCL and Vulkan, but I think there is something there, and thus why I hold out hope that OpenCL can be a long time solution.

@hakonanes
Copy link
Member Author

hakonanes commented Nov 22, 2021

Thank you for this valuable input, @drowenhorst-nrl.

In PyEBSDIndex this is mitigated by inherently performing many calculations on a large batch of patterns.

This could be easily adopted in kikuchipy I think, since we use Dask to spread the workload on all available CPUs. Dask does this by operating on chunks of the full pattern array. A chunk is typically 100 MB in size, and the signal (detector) axes are never chunked. Thus it would seem like sending chunks on to the GPU would make sense.

Be aware that I interleave the batches of patterns, thus each pattern looks like it has N channels, and is WxH in 2D size (vs being WxH with N slices in a volume). In my case this significantly reduced my global memory fetches within the GPU.

You're describing what you're doing in the following, right?

https://github.com/USNavalResearchLaboratory/PyEBSDIndex/blob/d8f3f4df368a7bc416517b266c518e6c2ad586de/pyebsdindex/radon_fast.py#L221-L237

I think I understand what you're doing, that you're "allocating" a (16 or more patterns, n detector rows, n detector columns) 32-bit floating point array on the GPU. Passed to the CL kernel backSub, you subtract the background one pattern at a time in a loop along the first axis.

Our Dask chunks are usually always 4D, so in our case backSub would have to have two nested loops. Or, we could do the same reshaping as you're doing beforehand.

Also look at the gputools package. https://github.com/maweigert/gputools they might have a lot of what you want. They inspired a lot of my initial efforts.

Looks like a good reference, and perhaps something we could depend on for some functionality.

@hakonanes hakonanes added the help wanted Would be nice if someone could help label Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Would be nice if someone could help
Projects
None yet
Development

No branches or pull requests

2 participants