Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ hgemm ] Consider K=1 changes #2654

Merged
merged 3 commits into from
Jun 28, 2024

Conversation

skykongkong8
Copy link
Member

@skykongkong8 skykongkong8 commented Jun 28, 2024

  • Current implementation is rooted on general cases, thus optimize only w.r.t. K accumulation.
  • However, when it comes to M,1 x 1,N computation, all optimizations like packing, transposing is no use.
  • Implementing a explicit kernel function for such case resolved the latency issue.
dim = (576, 1) x (1, 1024) fp16 fp32
noTrans 190834 ns 380525 ns
transA 173896 ns 387860 ns
transB 180369 ns 382123 ns
transAB 179263 ns 379238 ns

Since this is K=1 case, we do not need to partial-accumulate w.r.t. K-direction with fp32, thereby accelerated approximately 200%

Self evaluation:

  1. Build test: [X]Passed [ ]Failed [ ]Skipped
  2. Run test: [X]Passed [ ]Failed [ ]Skipped

@taos-ci
Copy link
Collaborator

taos-ci commented Jun 28, 2024

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2654. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

@skykongkong8 skykongkong8 force-pushed the pr/hgemm/k1case branch 3 times, most recently from 1ecd0d7 to a2d0536 Compare June 28, 2024 02:09
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

Copy link
Collaborator

@jijoongmoon jijoongmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

- Current implementation is rooted on general cases, thus optimize only w.r.t. K accumulation.
- However, when it comes to M,1 x 1,N computation, all optimizations like packing, transposing is no use.
- Implementing a explicit kernel function for such case resolved the latency issue.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- To cover transpose cases like, (1,M).T * (1,N) and all other transpose combinations, transpose with SIMD, and apply the original kernel

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
…8_case

- Since K1 GEMM does not use data packing, I did not use aligned memory allocation.
- However, for SIMD situation, using such is more preferred.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

Copy link
Member

@SeoHyungjun SeoHyungjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jijoongmoon jijoongmoon merged commit 812fcf0 into nnstreamer:main Jun 28, 2024
43 checks passed
@skykongkong8 skykongkong8 deleted the pr/hgemm/k1case branch July 1, 2024 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants