Skip to content

Commit

Permalink
[fix] horovod example and unit test
Browse files Browse the repository at this point in the history
  • Loading branch information
jq authored and rhdong committed Jun 29, 2024
1 parent 1e5aed9 commit a8642aa
Show file tree
Hide file tree
Showing 13 changed files with 211 additions and 118 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,5 @@ bazel-genfiles
/pip-wheel-metadata/

.bazelrc

model_dir/
export_dir/
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,18 @@
- enable gpu by `python3 -m pip install tensorflow[and-cuda]`
- `HOROVOD_WITH_MPI=1 HOROVOD_WITH_GLOO=1 pip install --no-cache-dir horovod`
- recommend to use nv docker image `nvcr.io/nvidia/tensorflow:24.02-tf2-py3`
- run `rm -rf model_dir/ export_dir/` to clean up the model and export directory before running the script
## start train:
By default, this shell will start a train task with N workers as GPU number on local machine.

```shell
sh start.sh
```
run a debug task with only 1 steps_per_epoch
```shell
sh start.sh 1
```
## start export for serving:
```shell
sh test.sh export
sh test.sh inference
```
Loading

0 comments on commit a8642aa

Please sign in to comment.