Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU trsm performance drops with large block sizes #30

Open
abouteiller opened this issue Nov 6, 2020 · 6 comments
Open

GPU trsm performance drops with large block sizes #30

abouteiller opened this issue Nov 6, 2020 · 6 comments
Labels
bug Something isn't working high priority This is an important feature

Comments

@abouteiller
Copy link
Contributor

Original report by Mikael Simberg (Bitbucket: [Mikael Simberg](https://bitbucket.org/Mikael Simberg), ).


I’m comparing the performance of dplasma with other libraries on GPUs, and I’m particularly looking at trsm at the moment. I see performance initially increase with the block size until it reaches a good fraction of the peak flops of the GPU, but after that performance drops significantly. More concretely, I’m running dplasma on a single node with a P100 GPU, 12 core Haswell CPU (Piz Daint GPU partition), built in release mode with GCC 8.3, CUDA 10.2 (I pass no additional options to the CMake configuration except the build type) and get the following results:

$ for block_exp in {6..14}; do block_size=$((2 ** block_exp)); srun -n 1 tests/testing_strsm -c 12 -M 16384 -N 16384 --NB ${block_size} --MB ${block_size} -p 1 -q 1 -g 1; done                              
[****] TIME(s)    828.03722 : strsm     PxQ=   1 1   NB=   64 N=   16384 :       5.311736 gflops                                                                                                                                                                                 
[****] TIME(s)     14.29047 : strsm     PxQ=   1 1   NB=  128 N=   16384 :     307.779589 gflops                                                                                                                                                                                 
[****] TIME(s)      2.02718 : strsm     PxQ=   1 1   NB=  256 N=   16384 :    2169.675000 gflops                                                                                                                                                                                 
[****] TIME(s)      0.95916 : strsm     PxQ=   1 1   NB=  512 N=   16384 :    4585.610550 gflops                                                                                                                                                                                 
[****] TIME(s)      1.36078 : strsm     PxQ=   1 1   NB= 1024 N=   16384 :    3232.196247 gflops                                                                                                                                                                                 
[****] TIME(s)      2.25861 : strsm     PxQ=   1 1   NB= 2048 N=   16384 :    1947.351262 gflops                                                                                                                                                                                 
[****] TIME(s)      5.96737 : strsm     PxQ=   1 1   NB= 4096 N=   16384 :     737.061370 gflops                                                                                                                                                                                 
[****] TIME(s)     14.53428 : strsm     PxQ=   1 1   NB= 8192 N=   16384 :     302.616577 gflops                                                                                                                                                                                 
[****] TIME(s)     56.22007 : strsm     PxQ=   1 1   NB= 16384 N=   16384 :      78.233892 gflops

Is this expected behaviour? What could explain the drop? Is there something in the configuration that could be causing this? I don’t actually expect to be running with huge block sizes (especially B == N), but I wouldn’t have expected performance to start dropping already at NB = MB = 1024.

@abouteiller
Copy link
Contributor Author

This is normal: when N=NB (extreme case), you have absolutely no parallelism.

In general, you want NB in the region of 2k on P100 GPUs, but if you are running a very small problem, the best NB may be smaller (balance between kernel efficiency, and algorithm parallelism).

@abouteiller
Copy link
Contributor Author

This is normal behavior.

@abouteiller
Copy link
Contributor Author

Original comment by Mikael Simberg (Bitbucket: [Mikael Simberg](https://bitbucket.org/Mikael Simberg), ).


Thanks for the response. Since I don’t know much of how dplasma works internally, could you explain why you say there is no parallelism when NB = N? In my naive view this would just degenerate to a single cublas trsm call, but apparently that’s not the case? Again, I’m not really expecting best performance at NB = N, I’m just trying to get understand what’s happening that’s causing the performance to start dropping already after NB = 512.

@abouteiller
Copy link
Contributor Author

The TRSM algorithm in DPLASMA is written in blocked version. We first do the TRSM on the leftmost column (TRSM on CPU), then apply the update with GEMM on the remainder of the matrix (GEMM on GPU). With NB=N, the only call is a single serial TRSM on CPU.

It would probably make sense for us to revisit and execute the TRSMs on GPU as well (in this case it would degenerate into running a single CUBLAS TRSM for N=NB). This was designed when CUBLAS was very slow on most non-gemm operations, this is not really a problem anymore.

@abouteiller
Copy link
Contributor Author

Reopen because the behavior is 'as expected' but is sub-optimal, and there is not real reason to not call CUBLAS in all cases anymore.

@abouteiller
Copy link
Contributor Author

Original comment by Mikael Simberg (Bitbucket: [Mikael Simberg](https://bitbucket.org/Mikael Simberg), ).


Thanks for the explanation, the performance drop makes a lot more sense if it ends up running on the CPU!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority This is an important feature
Projects
None yet
Development

No branches or pull requests

1 participant