-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ hgemm ] Consider K=1 changes #2654
Conversation
📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2654. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/. |
1ecd0d7
to
a2d0536
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
9651619
to
86fd36d
Compare
- Current implementation is rooted on general cases, thus optimize only w.r.t. K accumulation. - However, when it comes to M,1 x 1,N computation, all optimizations like packing, transposing is no use. - Implementing a explicit kernel function for such case resolved the latency issue. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- To cover transpose cases like, (1,M).T * (1,N) and all other transpose combinations, transpose with SIMD, and apply the original kernel **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
…8_case - Since K1 GEMM does not use data packing, I did not use aligned memory allocation. - However, for SIMD situation, using such is more preferred. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
86fd36d
to
63817f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Since this is K=1 case, we do not need to partial-accumulate w.r.t. K-direction with fp32, thereby accelerated approximately 200%
Self evaluation: