Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU/OpenCL] Initial version of FC Layer with OpenCL ops #2570

Merged
merged 6 commits into from
May 23, 2024

Conversation

s-debadri
Copy link
Contributor

@s-debadri s-debadri commented May 7, 2024

Added initial version of FC Layer for GPU. This is a basic implementation using naive kernels.
Changes added with this PR:

  • fc_layer_cl.cpp added containing the new FullyConnectedLayerCl class for OpenCL implementation.
  • Modified registerFactory in cl_context to add FullyConnectedLayerCl.
  • Re-used FullyConnected API in layer.h for CPU/GPU execution depending on compute engine option.
  • Created common BLAS OpenCL kernels for usage inside blas_kernels to enhance re-usability.
  • Updated LayerKernel enum inside layer_context.h.
  • Added unittest_layers_fully_connected_cl.cpp to test FC Layer on GPU.
  • Added unit test for incremental forwarding in layers_golden_tests.cpp

To do: Modify data transfer by adding cl_buffer with var_grad, optimize kernels.

Self evaluation:

  1. Build test: [X]Passed [ ]Failed [ ]Skipped
  2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Debadri Samaddar [email protected]

@taos-ci
Copy link
Collaborator

taos-ci commented May 7, 2024

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2570. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

/**
* @brief Helper function to create fully connected layer for GPU
*/
inline std::unique_ptr<Layer> FullyConnectedCl(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not this PR, but how about create CL Layer with same interface with device options like cpu/gpu?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createLayer(const LayerType &type,
            const std::vector<std::string> &properties = {},
            const LayerComputeEngine &compute_engine = LayerComputeEngine::CPU);

This will set lnode->setComputeEngine(compute_engine); which can be used inside layer classes using RunLayerContext reference. We can use that to have conditions within a layer class instead of having separate class for CPU/GPU.

@@ -0,0 +1,616 @@
/**
Copy link
Collaborator

@jijoongmoon jijoongmoon May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to use SPDX-License-Identifier: Apache-2.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated commit.

} while (false);
}

void FullyConnectedLayerCl::incremental_forwarding(RunLayerContext &context,
Copy link
Collaborator

@jijoongmoon jijoongmoon May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like incremental_forwarding does not use GPU. I'm worried little bit.. Currently we are using incremental_forwording for LLaMA Application due to auto-regressive nature of LLM. Could you check once again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added GPU kernel call for incremental_forwarding as well in updated commit.

return cl_ret;
}

void FullyConnectedLayerCl::fc_sgemm_cl(const float *A, const float *B,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can use these blas cl operations in other layers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in latest commit

#include <node_exporter.h>
#include <util_func.h>

std::string fc_sgemv_cl_kernel_ =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make a binary cl kernel to reduce online compilation time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is present in the current code. So, program.CreateCLProgram call from RunLayerContext::clCreateKernel will create kernel binaries for the first run and re-use them from the next runs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to set the proper directory to save the CL Kernel Binary then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. As of now it is being set by opencl-kernel-path of meson_options.

Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

Added naive version of OpenCl implementation for FC Layer.
Incorporated separate kernels for ops used.
Added unit test for fc_layer_cl.

Signed-off-by: Debadri Samaddar <[email protected]>
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

Added incremental forwarding as an option for unit testing layers

Signed-off-by: Debadri Samaddar <[email protected]>
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

Added blas_kernels to enhance resuability of the common blas kernels.
Used FullyConnected interface for both CPU and GPU calls.

Signed-off-by: Debadri Samaddar <[email protected]>
@s-debadri s-debadri changed the title [WIP] [GPU/OpenCL] Initial version of FC Layer with OpenCL ops [GPU/OpenCL] Initial version of FC Layer with OpenCL ops May 14, 2024
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

Copy link
Collaborator

@jijoongmoon jijoongmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Renamed global variables in unittest_layers_fully_connected_cl.cpp to fix duplicate declaration error

Signed-off-by: Debadri Samaddar <[email protected]>
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

__global float* Y, unsigned int M, unsigned int N) {
unsigned int i, j;
i = get_global_id(0);
float y0 = Y[i] * 0.0f;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float y0 = 0.0f; ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in latest commit. Thanks.

})";

std::string dot_cl_kernel_ =
R"(__kernel void dot_cl(const __global float* A, const __global float* X, unsigned int K, float res) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float &res ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing, result was not properly fetched. Also the suggested syntax won't work since OpenCL buffers can not reference to host memory. Used global memory pointer to fix the issue.

unsigned int m = get_global_id(0);
for (unsigned int n = 0; n < N; ++n) {
float c = 0.0;
float c_old = C[m * ldc + n];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems unused.

Suggested change
float c_old = C[m * ldc + n];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Thanks.

* @param[in] M number of op(A)'s and C's row
* @param[in] N number of op(B)'s and C's columns
* @param[in] K number of op(A)'s and columns and op(B)'s rows
* @param[in] context RunLayerContext reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add descriptions for lda, ldb, and ldc as well?

Copy link
Contributor Author

@s-debadri s-debadri May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in the latest commit.

Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

Fixed kernel argument bug for dot_cl kernel

Signed-off-by: Debadri Samaddar <[email protected]>
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

Comment on lines 36 to 37
* @param[in] dim1 number of A's row
* @param[in] dim2 number of X's columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following your code, it seems
A's dim= (dim2 x dim1) / X's = (dim1,) / Y's = (dim2,) if vectors are column vector.
Please check.

Suggested change
* @param[in] dim1 number of A's row
* @param[in] dim2 number of X's columns
* @param[in] dim1 number of A's columns
* @param[in] dim2 number of A's rows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected. Thanks.

* @param[in] dim1 number of elements in both input vectors
* @param[in] context RunLayerContext reference
*/
float dot_cl(const float *matAdata, const float *vecXdata, unsigned int dim1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems better to change the variable name as

Suggested change
float dot_cl(const float *matAdata, const float *vecXdata, unsigned int dim1,
float dot_cl(const float *vecAdata, const float *vecXdata, unsigned int dim1,

__global float* C, unsigned int M, unsigned int N, unsigned int K, unsigned int lda, unsigned int ldb, unsigned int ldc) {

unsigned int m = get_global_id(0);
for (unsigned int n = 0; n < N; ++n) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick question. is there room for optimization as follows or would it make no difference?

Suggested change
for (unsigned int n = 0; n < N; ++n) {
unsigned int n = get_global_id(1);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it can be optimized with 2D global work size. I have added this change. However sometimes using a 2D work size may introduce unnecessary overhead and reduce the performance of the kernel depending on the memory access pattern.
Going forward we can experiment with large models and use the optimal work size based on the performance.


size_t dim1_size = sizeof(float) * dim1;
size_t dim2_size = sizeof(float) * dim2;
opencl::Buffer inputA(context.context_inst_, dim1_size * dim2_size, true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputA size would be dim1 * dim2 * sizeof(float) in general.
what makes it different for OpenCL to require size as dim1 * dim2 * sizeof(float) * sizeof(float)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this bug. Thanks for noticing.

Used proper size while creating OpenCL buffers.
Optimized SGEMM kernel with 2D global work size.
Modified function docs.

Signed-off-by: Debadri Samaddar <[email protected]>
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

Copy link
Collaborator

@jijoongmoon jijoongmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@jijoongmoon jijoongmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@EunjuYang EunjuYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

std::string sgemv_cl_kernel_ =
R"(__kernel void sgemv_cl(const __global float* A, const __global float* X,
__global float* Y, unsigned int M, unsigned int N) {
unsigned int i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious and trivial, but why don't we use lda while sgemm_cl_kernel use lda and ldb?
I can even observe lda in

// @ same file
...
void sgemv_cl(const float *matAdata, const float *vecXdata, float *vecYdata,
              unsigned int dim1, unsigned int dim2, unsigned int lda,
              RunLayerContext &context);
...

Adding params like lda and ldb would be much more helpful for future development as far as I am concerned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this can be added in future to make it more generic.

@jijoongmoon jijoongmoon merged commit ddf8104 into nnstreamer:main May 23, 2024
36 checks passed
@s-debadri s-debadri deleted the gpu_fc branch May 23, 2024 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants