Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[shardformer] sync tests modification toto sequence parallel branch #4434

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
cc3cbe9
[workflow] show test duration (#4159)
FrankLeeeee Jul 4, 2023
190a6ea
[dtensor] fixed readme file name and removed deprecated file (#4162)
FrankLeeeee Jul 4, 2023
fee32a3
[docker] added ssh and rdma support for docker (#4192)
FrankLeeeee Jul 7, 2023
5891344
Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini …
Fridge003 Jul 7, 2023
c1cf752
[docker] fixed ninja build command (#4203)
FrankLeeeee Jul 10, 2023
4e9b09c
Automated submodule synchronization (#4217)
github-actions[bot] Jul 12, 2023
9a4842c
revise shardformer readme (#4246)
CjhHa1 Jul 17, 2023
7ff11b5
[example] add llama pretraining (#4257)
binmakeswell Jul 17, 2023
4b97754
[Kernels] added triton-implemented of self attention for colossal-ai …
tiandiao123 Jul 18, 2023
fc5cef2
[lazy] support init on cuda (#4269)
ver217 Jul 19, 2023
c6f6005
[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302)
Fridge003 Jul 21, 2023
02192a6
[ci] support testmon core pkg change detection (#4305)
ver217 Jul 21, 2023
b366f1d
[NFC] Fix format for mixed precision (#4253)
CjhHa1 Jul 17, 2023
86cf6ae
Fix/format (#4261)
MichelleMa8 Jul 18, 2023
915ed8b
[NFC] polish applications/Chat/inference/requirements.txt code style …
Camille7777 Jul 18, 2023
77c469e
[NFC] polish applications/Chat/coati/models/base/actor.py code style …
jason524w Jul 18, 2023
dee1c96
[NFC] policy applications/Chat/examples/ray/mmmt_prompt.py code style…
CZYCW Jul 18, 2023
85774f0
[NFC] polish colossalai/cli/benchmark/utils.py code style (#4254)
yuanheng-zhao Jul 18, 2023
c614a99
[NFC] polish colossalai/auto_parallel/offload/amp_optimizer.py code s…
Yanjia0 Jul 18, 2023
abe4f97
[NFC] polish colossalai/booster/plugin/low_level_zero_plugin.py code …
supercooledith Jul 18, 2023
b2debdc
[NFC] polish applications/Chat/coati/dataset/sft_dataset.py code styl…
zhengzangw Jul 18, 2023
798cb72
[NFC] polish applications/Chat/coati/trainer/base.py code style (#4260)
Shenggan Jul 18, 2023
3883db4
[NFC] polish unary_elementwise_generator.py code style (#4267)
YeAnbang Jul 18, 2023
fee5532
[NFC] polish runtime_preparation_pass style (#4266)
CWHer Jul 18, 2023
a50d39a
[NFC] fix: format (#4270)
dayellow Jul 18, 2023
1ce997d
[NFC] polish applications/Chat/examples/train_reward_model.py code st…
Xu-Kai Jul 18, 2023
caa4433
[NFC] fix format of application/Chat/coati/trainer/utils.py (#4273)
kurisusnowdeng Jul 18, 2023
dc1b612
[NFC] polish applications/Chat/inference/server.py code style (#4274)
chengeharrison Jul 18, 2023
709e121
[NFC] polish applications/Chat/coati/models/generation.py code style …
yangluo7 Jul 18, 2023
c972d65
applications/Chat/.gitignore (#4279)
henryqin1997 Jul 19, 2023
9e51293
[NFC] polish applications/Chat/coati/trainer/strategies/base.py code …
ziruizhu Jul 19, 2023
0991405
[NFC] polish applications/Chat/coati/models/utils.py codestyle (#4277)
yuxuan-lou Jul 19, 2023
ef4b99e
add llama example CI
binmakeswell Jul 22, 2023
5187c96
support session-based training (#4313)
chengeharrison Jul 28, 2023
c6ab969
[zero] refactor low level zero for shard evenly (#4030)
Gy-Lu Jun 30, 2023
79cf1b5
[zero]support no_sync method for zero1 plugin (#4138)
Gy-Lu Jul 4, 2023
c668801
[zero] allow passing process group to zero12 (#4153)
Gy-Lu Jul 4, 2023
dd7cc58
[zero] add state dict for low level zero (#4179)
Gy-Lu Jul 6, 2023
1a49a5e
[zero] support shard optimizer state dict of zero (#4194)
Gy-Lu Jul 11, 2023
45b08f0
[zero] optimize the optimizer step time (#4221)
Gy-Lu Jul 18, 2023
03654c0
fix localhost measurement (#4320)
Gy-Lu Aug 1, 2023
75c5389
[chat] fix compute_approx_kl (#4338)
CWHer Aug 1, 2023
8064771
[release] update version (#4332)
ver217 Aug 1, 2023
16c0acc
[hotfix] update gradio 3.11 to 3.34.0 (#4329)
chncaption Aug 1, 2023
16bf4c0
[test] remove useless tests (#4359)
ver217 Aug 1, 2023
da4f7b8
[chat] fix bugs and add unit tests (#4213)
CWHer Aug 2, 2023
25c57b9
[fix] coloattention support flash attention 2 (#4347)
flybird11111 Aug 4, 2023
38b792a
[coloattention] fix import error (#4380)
flybird11111 Aug 4, 2023
f40b718
[doc] Fix gradient accumulation doc. (#4349)
flybird11111 Aug 4, 2023
089c365
[doc] add Series A Funding and NeurIPS news (#4377)
binmakeswell Aug 4, 2023
7c84f51
[Shardformer] Merge flash attention branch to pipeline branch (#4362)
flybird11111 Aug 7, 2023
2e77e57
[pipeline] rewrite t5 tests & support multi-tensor transmitting in pi…
Fridge003 Aug 8, 2023
458ae33
[kernel] updated unittests for coloattention (#4389)
flybird11111 Aug 9, 2023
c14920a
[shardformer] update shardformer to use flash attention 2 (#4392)
flybird11111 Aug 9, 2023
ed2c229
[shardformer] test all optimizations (#4399)
flybird11111 Aug 10, 2023
6ccecc0
[gemini] fix tensor storage cleaning in state dict collection (#4396)
Fridge003 Aug 10, 2023
9916a19
[pipeline] rewrite bert tests and fix some bugs (#4409)
CjhHa1 Aug 11, 2023
fcbf80f
[shardformer]fix, test gpt2 for AMP+TP (#4403)
flybird11111 Aug 11, 2023
d86ddd9
[hotfix] fix unsafe async comm in zero (#4404)
Gy-Lu Aug 11, 2023
1e518ae
[shardformer] rewrite tests for opt/bloom/llama/vit/chatglm (#4395)
Fridge003 Aug 11, 2023
d4a3a10
[shardformer] update tests for all optimization (#4413)
flybird11111 Aug 11, 2023
6990477
Merge branch 'main' into feature/pipeline
ver217 Aug 14, 2023
ac8d4ed
[shardformer]update t5 tests for using all optimizations. (#4407)
flybird11111 Aug 14, 2023
82ea190
[shardformer] update bloom/llama/vit/chatglm tests (#4420)
flybird11111 Aug 14, 2023
60db2cc
Merge pull request #4424 from ver217/sync/pipeline
FrankLeeeee Aug 14, 2023
9d1a6d2
[misc] resolve code factor issues (#4433)
ver217 Aug 14, 2023
2dd1b39
[sync] update tests modification toto sequence parallel branch
flybird11111 Aug 14, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ jobs:

- name: Execute Unit Testing
run: |
CURL_CA_BUNDLE="" PYTHONPATH=$PWD pytest --testmon --testmon-cov=. tests/
CURL_CA_BUNDLE="" PYTHONPATH=$PWD pytest --testmon --testmon-cov=. --durations=10 tests/
env:
DATA: /data/scratch/cifar-10
NCCL_SHM_DISABLE: 1
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/build_on_schedule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: Build on Schedule
on:
schedule:
# run at 00:00 of every Sunday
- cron: '0 0 * * *'
- cron: "0 0 * * *"
workflow_dispatch:

jobs:
Expand Down Expand Up @@ -60,7 +60,7 @@ jobs:
- name: Unit Testing
if: steps.check-avai.outputs.avai == 'true'
run: |
PYTHONPATH=$PWD pytest tests
PYTHONPATH=$PWD pytest --durations=0 tests
env:
DATA: /data/scratch/cifar-10
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/compatiblity_test_on_dispatch.yml
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ jobs:
ssh-key: ${{ secrets.SSH_KEY_FOR_CI }}
- name: Download cub for CUDA 10.2
run: |
CUDA_VERSION=$(cat $CUDA_HOME/version.txt | grep "CUDA Version" | awk '{print $NF}' | cut -d. -f1,2)
CUDA_VERSION=$(nvcc -V | awk -F ',| ' '/release/{print $6}')

# check if it is CUDA 10.2
# download cub
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/compatiblity_test_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ jobs:
ssh-key: ${{ secrets.SSH_KEY_FOR_CI }}
- name: Download cub for CUDA 10.2
run: |
CUDA_VERSION=$(cat $CUDA_HOME/version.txt | grep "CUDA Version" | awk '{print $NF}' | cut -d. -f1,2)
CUDA_VERSION=$(nvcc -V | awk -F ',| ' '/release/{print $6}')

# check if it is CUDA 10.2
# download cub
Expand Down
12 changes: 12 additions & 0 deletions .github/workflows/compatiblity_test_on_schedule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,18 @@ jobs:
with:
ssh-key: ${{ secrets.SSH_KEY_FOR_CI }}

- name: Download cub for CUDA 10.2
run: |
CUDA_VERSION=$(nvcc -V | awk -F ',| ' '/release/{print $6}')

# check if it is CUDA 10.2
# download cub
if [ "$CUDA_VERSION" = "10.2" ]; then
wget https://github.com/NVIDIA/cub/archive/refs/tags/1.8.0.zip
unzip 1.8.0.zip
cp -r cub-1.8.0/cub/ colossalai/kernel/cuda_native/csrc/kernels/include/
fi

- name: Install Colossal-AI
run: |
pip install -v --no-cache-dir .
Expand Down
12 changes: 12 additions & 0 deletions .github/workflows/cuda_ext_check_before_merge.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,18 @@ jobs:
- name: Install PyTorch
run: eval ${{ matrix.build.torch_command }}

- name: Download cub for CUDA 10.2
run: |
CUDA_VERSION=$(nvcc -V | awk -F ',| ' '/release/{print $6}')

# check if it is CUDA 10.2
# download cub
if [ "$CUDA_VERSION" = "10.2" ]; then
wget https://github.com/NVIDIA/cub/archive/refs/tags/1.8.0.zip
unzip 1.8.0.zip
cp -r cub-1.8.0/cub/ colossalai/kernel/cuda_native/csrc/kernels/include/
fi

- name: Build
run: |
CUDA_EXT=1 pip install -v .
4 changes: 3 additions & 1 deletion .github/workflows/run_chatgpt_examples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@ jobs:
run: |
cd applications/Chat
rm -rf ~/.cache/colossalai
./examples/test_ci.sh
./tests/test_inference.sh
./tests/test_benchmarks.sh
./tests/test_train.sh
env:
NCCL_SHM_DISABLE: 1
MAX_JOBS: 8
Expand Down
16 changes: 14 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,15 @@
</div>

## Latest News
* [2023/07] [HPC-AI Tech Raises 22 Million USD in Series A Funding](https://www.hpc-ai.tech/blog/hpc-ai-tech-raises-22-million-usd-in-series-a-funding-to-fuel-team-expansion-and-business-growth)
* [2023/07] [65B Model Pretraining Accelerated by 38%, Best Practices for Building LLaMA-Like Base Models Open-Source](https://www.hpc-ai.tech/blog/large-model-pretraining)
* [2023/03] [ColossalChat: An Open-Source Solution for Cloning ChatGPT With a Complete RLHF Pipeline](https://medium.com/@yangyou_berkeley/colossalchat-an-open-source-solution-for-cloning-chatgpt-with-a-complete-rlhf-pipeline-5edf08fb538b)
* [2023/03] [Intel and Colossal-AI Partner to Deliver Cost-Efficient Open-Source Solution for Protein Folding Structure Prediction](https://www.hpc-ai.tech/blog/intel-habana)
* [2023/03] [AWS and Google Fund Colossal-AI with Startup Cloud Programs](https://www.hpc-ai.tech/blog/aws-and-google-fund-colossal-ai-with-startup-cloud-programs)
* [2023/02] [Open Source Solution Replicates ChatGPT Training Process! Ready to go with only 1.6GB GPU Memory](https://www.hpc-ai.tech/blog/colossal-ai-chatgpt)
* [2023/01] [Hardware Savings Up to 46 Times for AIGC and Automatic Parallelism](https://medium.com/pytorch/latest-colossal-ai-boasts-novel-automatic-parallelism-and-offers-savings-up-to-46x-for-stable-1453b48f3f02)
* [2022/11] [Diffusion Pretraining and Hardware Fine-Tuning Can Be Almost 7X Cheaper](https://www.hpc-ai.tech/blog/diffusion-pretraining-and-hardware-fine-tuning-can-be-almost-7x-cheaper)
* [2022/10] [Use a Laptop to Analyze 90% of Proteins, With a Single-GPU Inference Sequence Exceeding 10,000](https://www.hpc-ai.tech/blog/use-a-laptop-to-analyze-90-of-proteins-with-a-single-gpu-inference-sequence-exceeding)
* [2022/09] [HPC-AI Tech Completes $6 Million Seed and Angel Round Fundraising](https://www.hpc-ai.tech/blog/hpc-ai-tech-completes-6-million-seed-and-angel-round-fundraising-led-by-bluerun-ventures-in-the)

## Table of Contents
<ul>
Expand All @@ -49,6 +50,7 @@
<li>
<a href="#Parallel-Training-Demo">Parallel Training Demo</a>
<ul>
<li><a href="#LLaMA">LLaMA</a></li>
<li><a href="#GPT-3">GPT-3</a></li>
<li><a href="#GPT-2">GPT-2</a></li>
<li><a href="#BERT">BERT</a></li>
Expand Down Expand Up @@ -216,6 +218,15 @@ Acceleration of [AlphaFold Protein Structure](https://alphafold.ebi.ac.uk/)

## Parallel Training Demo

### LLaMA
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/examples/images/LLaMA_pretraining.png" width=600/>
</p>

- 65-billion-parameter large model pretraining accelerated by 38%
[[code]](https://github.com/hpcaitech/ColossalAI/tree/example/llama/examples/language/llama)
[[blog]](https://www.hpc-ai.tech/blog/large-model-pretraining)

### GPT-3
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/GPT3-v5.png" width=700/>
Expand Down Expand Up @@ -452,6 +463,7 @@ To cite this project, you can use the following BibTeX citation.
}
```

Colossal-AI has been accepted as official tutorial by top conferences [SC](https://sc22.supercomputing.org/), [AAAI](https://aaai.org/Conferences/AAAI-23/), [PPoPP](https://ppopp23.sigplan.org/), [CVPR](https://cvpr2023.thecvf.com/), [ISC](https://www.isc-hpc.com/), etc.
Colossal-AI has been accepted as official tutorial by top conferences [NeurIPS](https://nips.cc/), [SC](https://sc22.supercomputing.org/), [AAAI](https://aaai.org/Conferences/AAAI-23/),
[PPoPP](https://ppopp23.sigplan.org/), [CVPR](https://cvpr2023.thecvf.com/), [ISC](https://www.isc-hpc.com/), [NVIDIA GTC](https://www.nvidia.com/en-us/on-demand/session/gtcspring23-S51482/) ,etc.

<p align="right">(<a href="#top">back to top</a>)</p>
2 changes: 1 addition & 1 deletion applications/Chat/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -145,4 +145,4 @@ docs/.build
# wandb log
example/wandb/

examples/awesome-chatgpt-prompts/
examples/awesome-chatgpt-prompts/
7 changes: 4 additions & 3 deletions applications/Chat/coati/dataset/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
from .prompt_dataset import PromptDataset
from .reward_dataset import HhRlhfDataset, RmStaticDataset
from .sft_dataset import DataCollatorForSupervisedDataset, SFTDataset, SupervisedDataset
from .sft_dataset import SFTDataset, SupervisedDataset
from .utils import is_rank_0

__all__ = [
'RmStaticDataset', 'HhRlhfDataset', 'is_rank_0', 'SFTDataset', 'SupervisedDataset',
'DataCollatorForSupervisedDataset', 'PromptDataset'
'RmStaticDataset', 'HhRlhfDataset',
'SFTDataset', 'SupervisedDataset',
'PromptDataset', 'is_rank_0',
]
87 changes: 87 additions & 0 deletions applications/Chat/coati/dataset/conversation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Copyright 2023 lm-sys@FastChat
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import dataclasses
from enum import Enum, auto
from typing import List


class SeparatorStyle(Enum):
ADD_EOS_TOKEN = auto()


@dataclasses.dataclass
class Conversation:
system: str
roles: List[str]
messages: List[List[str]]
offset: int
sep_style: SeparatorStyle = SeparatorStyle.ADD_EOS_TOKEN
sep: str = "</s>"

skip_next: bool = False

def get_prompt(self):
if self.sep_style == SeparatorStyle.ADD_EOS_TOKEN:
ret = self.system
for role, message in self.messages:
if message:
ret += role + ": " + message + self.sep
else:
ret += role + ": "
return ret
else:
raise ValueError(f"Invalid style: {self.sep_style}")

def append_message(self, role, message):
self.messages.append([role, message])

def to_gradio_chatbot(self):
ret = []
for i, (role, msg) in enumerate(self.messages[self.offset:]):
if i % 2 == 0:
ret.append([msg, None])
else:
ret[-1][-1] = msg
return ret

def copy(self):
return Conversation(system=self.system,
roles=self.roles,
messages=[[x, y] for x, y in self.messages],
offset=self.offset,
sep_style=self.sep_style,
sep=self.sep)

def dict(self):
return {
"system": self.system,
"roles": self.roles,
"messages": self.messages,
"offset": self.offset,
"sep": self.sep
}


conv = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
roles=("Human", "Assistant"),
messages=(),
offset=0,
sep_style=SeparatorStyle.ADD_EOS_TOKEN,
sep="</s>",
)

default_conversation = conv
18 changes: 6 additions & 12 deletions applications/Chat/coati/dataset/prompt_dataset.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,13 @@
import copy
import random
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Callable, Dict, Sequence
from typing import Dict

import torch
import torch.distributed as dist
import transformers
from torch.utils.data import Dataset
from tqdm import tqdm

from colossalai.logging import get_dist_logger

from .utils import is_rank_0, jload

logger = get_dist_logger()
from .utils import jload


class PromptDataset(Dataset):
Expand All @@ -27,12 +20,13 @@ def __init__(self,
max_length: int = 96):
super(PromptDataset, self).__init__()
self.keyed_prompt = defaultdict(list)
logger.info("Loading data...")
self.logger = get_dist_logger()
self.logger.info("Loading data...")
list_data_dict = jload(data_path)
logger.info(f"Loaded {len(list_data_dict)} examples.")
self.logger.info(f"Loaded {len(list_data_dict)} examples.")

if max_datasets_size is not None:
logger.info(f"Limiting dataset to {max_datasets_size} examples.")
self.logger.info(f"Limiting dataset to {max_datasets_size} examples.")
list_data_dict = list_data_dict[:max_datasets_size]

instructions = [data_dict["instruction"] for data_dict in list_data_dict]
Expand Down
Loading
Loading