Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support mixed precision training @open sesame 03/08 07:57 #2455

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

jihochu
Copy link
Contributor

@jihochu jihochu commented Feb 2, 2024

It adds loss scale factor for removing invalid data while training.
The factor is dynamically calculated while gradient clipping step, And it initially disabled until
loss scale proeprty is set.
fc/pooling/conv2d/softmax layers are modified for loss scale and mixed tensor type.

Signed-off-by: Jiho Chu [email protected]

@taos-ci
Copy link
Collaborator

taos-ci commented Feb 2, 2024

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2455. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

@taos-ci
Copy link
Collaborator

taos-ci commented Feb 2, 2024

:octocat: cibot: @jihochu, nntrainer/layers/pooling2d_layer.cpp does not include Doxygen tags such as @file @brief @author @bug. You must include the Doxygen tags in the source code. Please refer to a Doxygen manual at http://github.com/nnstreamer/TAOS-CI/blob/main/ci/doc/doxygen-documentation.md

@taos-ci
Copy link
Collaborator

taos-ci commented Feb 2, 2024

:octocat: cibot: @jihochu, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2455-202402021547410.65952706336975-af2e8829e8e0ac70333370e438e9b7b37bc604f2/.

@taos-ci
Copy link
Collaborator

taos-ci commented Feb 2, 2024

:octocat: cibot: @jihochu, nntrainer/layers/pooling2d_layer.cpp does not include Doxygen tags such as @file @brief @author @bug. You must include the Doxygen tags in the source code. Please refer to a Doxygen manual at http://github.com/nnstreamer/TAOS-CI/blob/main/ci/doc/doxygen-documentation.md

@taos-ci
Copy link
Collaborator

taos-ci commented Feb 2, 2024

:octocat: cibot: @jihochu, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2455-202402021637130.47981810569763-5ce56ff64b70e29561125de65169bff8ee06a41d/.

@taos-ci
Copy link
Collaborator

taos-ci commented Feb 2, 2024

:octocat: cibot: @jihochu, nntrainer/layers/pooling2d_layer.cpp does not include Doxygen tags such as @file @brief @author @bug. You must include the Doxygen tags in the source code. Please refer to a Doxygen manual at http://github.com/nnstreamer/TAOS-CI/blob/main/ci/doc/doxygen-documentation.md

Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihochu, 💯 All CI checkers are successfully verified. Thanks.

It add loss scale property as model common property.

Signed-off-by: Jiho Chu <[email protected]>
It checks derivative validation after backwarding, and apply gradient if
derivative validation success.

Signed-off-by: Jiho Chu <[email protected]>
clone method with tensor type is added for creating tensor with differenct
datatype. And, some convenient methods for loss scale is added.

Signed-off-by: Jiho Chu <[email protected]>
It adds tests for conv2d fp16 test.

Signed-off-by: Jiho Chu <[email protected]>
It fixes doygen comments from clang format checker.

Signed-off-by: Jiho Chu <[email protected]>
It installs loss_layer header file for custom loss layer.

Signed-off-by: Jiho Chu <[email protected]>
It is assumed that activations and weight are fully compotaible,
so it's unnecessary to be converted to.
input layer and loss layres are different, cause input data and label
data is assumed to be always float 32 type now.

Signed-off-by: Jiho Chu <[email protected]>
It may get an invalid value for both internal tensor or gradient.
This patch checks the validation of the data, and fix for it.

Also,
sscal api is replace with scopy for setZero, because it produces
the invalid value if invalid input value is used.

Signed-off-by: Jiho Chu <[email protected]>
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihochu, 💯 All CI checkers are successfully verified. Thanks.

@myungjoo
Copy link
Member

myungjoo commented Apr 20, 2024

Recommentation:

Keep a PR with every related commits as a test basis and mark it "Do Not Merge" or "Draft PR"
Make a number of independent smaller PRs ("sub-PR of this PR") so that reviewers can actually read and understand, removing them from the "full PR", whenever they are merged. It is recommended to start with PRs with common data structures and interfaces without implementation.

if (num_w_opt_m > 0)
run_context->getWeightOptMasterVar(i, j).read(file);
else
run_context->getWeightOptVar(i, j).read(file);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be reversed. The base model data need to be saved in FP16. Not the FP32. We could read FP16 and save it to Master Wegith.

weight_tensor_type);
TensorDim hidden_state_dim(batch_size, 1, max_timestep, unit,
weight_tensor_type);
hidden_state_dim.setDataType(context.getActivationDataType());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be just TensorDim hidden_state_dim(batch_size, 1, max_timestep, unit, context.getActivationDataType()).

void Manager::deallocateWeights() { weight_pool.deallocate(); }
void Manager::deallocateWeights() {
weight_pool.deallocate();
weight_master_pool.deallocate();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a separate pool for the weight master.

dim_a.setDataType(act_type);
var = weight_pool.requestOrExtend(shared_name, dim_a, var_exec_order,
var_ls, t_initializer);
var_m = weight_master_pool.requestOrExtend(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tensor pool can manage if we just request to weight_pool.

dim_a.setDataType(act_type);
var = weight_pool.request(name, dim_a, var_exec_order, var_ls,
t_initializer);
var_m = weight_master_pool.request(name, dim, var_exec_order, var_ls,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execution order of var_m should be applyGradient_order only.

@@ -353,10 +363,15 @@ sharedConstTensors NetworkGraph::forwarding(
bool training,
std::function<void(std::shared_ptr<LayerNode>, bool)> forwarding_op,
std::function<bool(void *userdata)> stop_cb, void *userdata) {

for (auto w : clip_weights) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we have to enable gradient clip property also true to use mixed precision training. I guess, this PR doesn't consider the case which enabled mixed + gradient clip.

DonghakPark added a commit to DonghakPark/nntrainer that referenced this pull request May 10, 2024
This PR is to update the mixed precision layer.
- integrate nnstreamer#2568 & nnstreamer#2455
- will update more test

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
DonghakPark added a commit to DonghakPark/nntrainer that referenced this pull request May 10, 2024
This PR is to update the mixed precision layer.
- integrate nnstreamer#2568 & nnstreamer#2455
- will update more test

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
DonghakPark added a commit to DonghakPark/nntrainer that referenced this pull request May 10, 2024
This PR is to update the mixed precision layer.
- integrate nnstreamer#2568 & nnstreamer#2455
- will update more test

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
DonghakPark added a commit to DonghakPark/nntrainer that referenced this pull request May 10, 2024
This PR is to update the mixed precision layer.
- integrate nnstreamer#2568 & nnstreamer#2455
- will update more test

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
DonghakPark added a commit to DonghakPark/nntrainer that referenced this pull request May 10, 2024
This PR is to update the mixed precision layer.
- integrate nnstreamer#2568 & nnstreamer#2455
- will update more test

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
DonghakPark added a commit to DonghakPark/nntrainer that referenced this pull request May 16, 2024
This PR is to update the mixed precision layer.
- integrate nnstreamer#2568 & nnstreamer#2455
- will update more test

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
DonghakPark added a commit to DonghakPark/nntrainer that referenced this pull request May 17, 2024
This PR is to update the mixed precision layer.
- integrate nnstreamer#2568 & nnstreamer#2455
- will update more test

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
DonghakPark added a commit to DonghakPark/nntrainer that referenced this pull request May 27, 2024
This PR is to update the mixed precision layer.
- integrate nnstreamer#2568 & nnstreamer#2455
- will update more test

**Self evaluation:**
1. Build test:	 [X]Passed [ ]Failed [ ]Skipped
2. Run test:	 [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants