Skip to content

Commit

Permalink
[refactor] move hf model download logic into seperate python file; im…
Browse files Browse the repository at this point in the history
…plement auto-download of safetensors in preference to pickles.
  • Loading branch information
guocuimi committed Dec 4, 2023
1 parent bb329ec commit 8d30cfe
Show file tree
Hide file tree
Showing 7 changed files with 105 additions and 56 deletions.
5 changes: 3 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ RUN cmake --build build --target scalellm --config Release -j$(nproc)

# install
RUN cmake --install build --prefix /app
RUN cp ./entrypoint.sh /app/entrypoint.sh
RUN cp ./scripts/download_hf_models.py /app/download_hf_models.py
RUN cp ./scripts/entrypoint.sh /app/entrypoint.sh
RUN cp ./requirements.txt /app/requirements.txt

# ---- Production ----
Expand Down Expand Up @@ -73,6 +74,6 @@ EXPOSE 8888
EXPOSE 9999

# start the server
ENTRYPOINT [ "./entrypoint.sh" ]
ENTRYPOINT [ "/app/entrypoint.sh" ]


27 changes: 15 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,10 +79,15 @@ The easiest way to get started with our project is by using the official Docker

### Docker Installation

You can download and install Docker from the official website: [Docker Installation](https://docs.docker.com/get-docker/).
You can download and install Docker from the official website: [Docker Installation](https://docs.docker.com/get-docker/). To use GPUs in docker, you also need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).

> **Note**<br />
> To use GPUs, you also need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).
Here are all available docker images:
| Docker Image | cuda 12.1 | cuda 11.8 |
| :--------------: | :-------: | :-------: |
| [scalellm](https://hub.docker.com/r/vectorchai/scalellm/tags) | Yes | No |
| [scalellm_cu118](https://hub.docker.com/r/vectorchai/scalellm_cu118/tags) | No | Yes |
| [scalellm-gateway](https://hub.docker.com/r/vectorchai/scalellm-gateway/tags) | - | - |
| [chatbot-ui](https://hub.docker.com/r/vectorchai/chatbot-ui/tags) | - | - |

### ScaleLLM server

Expand All @@ -96,21 +101,19 @@ docker run -it --gpus=all --net=host --shm-size=1g \
docker.io/vectorchai/scalellm:latest --logtostderr
```

> **Warning**<br />
> * The docker image with tag '[latest](https://hub.docker.com/r/vectorchai/scalellm/tags)' could be changed to a new version upon new release. I don't have an efficient method to automatically repull the latest image upon new release. You'll need to manually manage the image version. All the available images can be found [here](https://hub.docker.com/r/vectorchai/scalellm/tags?page=1&ordering=last_updated).
> * The docker image with tag '[latest](https://hub.docker.com/r/vectorchai/scalellm/tags)' is built with [CUDA 12.1](https://developer.nvidia.com/cuda-12-1-0-download-archive). If you want to use [CUDA 11.8](https://developer.nvidia.com/cuda-11-8-0-download-archive), please use the image '[docker.io/vectorchai/scalellm_cu118:latest](https://hub.docker.com/r/vectorchai/scalellm_cu118)' instead.
> * NCCL might fall back to using the host memory if NVLink or PCI is not available. To allow NCCL to use the host memory, we added '--shm-size=1g' to the docker run command.
This command starts the Docker container with GPU support and various configuration options.

- `HF_MODEL_ID` specifies which Hugging Face model you want to run.
- `HF_MODEL_REVISION` specifies which Hugging Face model revision you want to run. By default, it is set to `"main"`.
- `HF_MODEL_ALLOW_PATTERN` specifies which types of files are allowed to be downloaded. By default, it is set to `"*.json,*.safetensors,*.model"`.
- `DEVICE` specifies the device on which this model should run. By default, it is set to `"auto"`.
- `DEVICE` specifies the device on which this model should run. By default, it is set to `"auto"`, using all available GPUs. You can also specify a specific GPU by using `"cuda:0"`, or use CPU by using `"cpu"`.
- `HF_MODEL_ALLOW_PATTERN` specifies which types of files are allowed to be downloaded. By default, it will be configured automatically based on tensor type. Only use this option if the default configuration is not working for you.
- `HUGGING_FACE_HUB_TOKEN` specifies the token from [huggingface](https://huggingface.co/settings/tokens) for gated models.

> **Note**<br />
> Although ScaleLLM supports both `CPU` and `GPU`, we recommend using GPU for better performance. CPU support is mainly for debugging and testing purposes, so the performance might be sub-optimal. If you want to use CPU, please set `DEVICE=cpu` in the command.
> **Warning**<br />
> * The docker image with tag '[latest](https://hub.docker.com/r/vectorchai/scalellm/tags)' could be changed to a new version upon new release. In order to use latest image, you may need to repull the image with specific tag.
> * Two version of docker images are provided for cuda 12.1 and cuda 11.8. Please choose the right image for your environment.
> * NCCL might fall back to using the host memory if NVLink or PCI is not available. To allow NCCL to use the host memory, we added '--shm-size=1g' to the docker run command.
> * Although ScaleLLM supports both `CPU` and `GPU`, we recommend using GPU for better performance. CPU support is mainly for debugging and testing purposes, so the performance might be sub-optimal.
#### Ports and Endpoints

Expand Down
49 changes: 49 additions & 0 deletions scripts/download_hf_models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/usr/bin/env python3

import argparse
import os
from huggingface_hub import snapshot_download

# check if >safetensors are present in the model repo
def check_safetensors_present(model_id, revision):
from huggingface_hub import HfApi
# Authenticate with HF token
api = HfApi()
files = api.list_repo_files(repo_id=model_id,
revision=revision)
for file in files:
_, extension = os.path.splitext(file)
if extension == '.safetensors':
return True
return False

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--repo_id', type=str, default=None)
parser.add_argument('--revision', type=str, default=None)
parser.add_argument('--allow_patterns', type=str, default=None)
parser.add_argument('--cache_dir', type=str, default=None)
args = parser.parse_args()

repo_id = args.repo_id
assert args.repo_id, "Please provide a repo_id"

revision = args.revision if args.revision else "main"
cache_dir = args.cache_dir if args.cache_dir else None
allow_patterns = args.allow_patterns

if not allow_patterns:
# Define allowed file patterns for config, tokenizer, and model weights
has_safetensors = check_safetensors_present(repo_id, revision)
# download safetensors if present, otherwise download pickle files
allow_patterns = "*.json,*.safetensors,*.model" if has_safetensors else "*.json,*.bin,*.pth,*.model"

path = snapshot_download(args.repo_id,
revision=revision,
cache_dir=cache_dir,
allow_patterns=allow_patterns.split(","))
# print download path
print(path)



8 changes: 4 additions & 4 deletions entrypoint.sh → scripts/entrypoint.sh
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
#!/bin/bash

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"

DEVICE=${DEVICE:-"auto"}
# Set default values for HF_MODEL_REVISION and HF_MODEL_ALLOW_PATTERN
HF_MODEL_REVISION=${HF_MODEL_REVISION:-main}
# Define allowed file patterns for config, tokenizer, and model weights
HF_MODEL_ALLOW_PATTERN=${HF_MODEL_ALLOW_PATTERN:-"*.json,*.safetensors,*.model"}
# HF_MODEL_CACHE_DIR=${HF_MODEL_CACHE_DIR:-$HOME/.cache/huggingface_hub}
HF_MODEL_CACHE_DIR=${HF_MODEL_CACHE_DIR:-/models}

ARGS=""
Expand All @@ -13,7 +13,7 @@ ARGS=""
if [ -n "$HF_MODEL_ID" ]; then
echo "Downloading model from the Hugging Face hub for model id: "$HF_MODEL_ID" and revision: "$HF_MODEL_REVISION""

MODEL_PATH=$(python3 -c 'from huggingface_hub import snapshot_download; path = snapshot_download("'"$HF_MODEL_ID"'", revision="'"$HF_MODEL_REVISION"'", cache_dir="'"$HF_MODEL_CACHE_DIR"'", allow_patterns="'"$HF_MODEL_ALLOW_PATTERN"'".split(",")); print(path)')
MODEL_PATH=$(python3 ${SCRIPT_DIR}/download_hf_models.py --repo_id "$HF_MODEL_ID" --revision "$HF_MODEL_REVISION" --cache_dir "$HF_MODEL_CACHE_DIR" --allow_patterns "$HF_MODEL_ALLOW_PATTERN")
# return if error
if [ $? -ne 0 ]; then
echo "Error downloading model from the Hugging Face hub for model id: "$HF_MODEL_ID" and revision: "$HF_MODEL_REVISION""
Expand Down
22 changes: 22 additions & 0 deletions scripts/scalellm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"

# Construct the arguments to pass to the 'scalellm' command
ARGS=""

# Check if HF_MODEL_ID is defined; if so, download the model from the Hugging Face hub
if [ -n "$HF_MODEL_ID" ]; then
echo "Downloading model from the Hugging Face hub for model id: "$HF_MODEL_ID" and revision: "$HF_MODEL_REVISION""

MODEL_PATH=$(python3 ${SCRIPT_DIR}/download_hf_models.py --repo_id "$HF_MODEL_ID" --revision "$HF_MODEL_REVISION" --cache_dir "$HF_MODEL_CACHE_DIR" --allow_patterns "$HF_MODEL_ALLOW_PATTERN")
# return if error
if [ $? -ne 0 ]; then
echo "Error downloading model from the Hugging Face hub for model id: "$HF_MODEL_ID" and revision: "$HF_MODEL_REVISION""
exit 1
fi
ARGS+=" --model_path "$MODEL_PATH" --model_id "$HF_MODEL_ID""
fi

# Run the 'scalellm' with the specified arguments
$SCRIPT_DIR/../build/src/server/scalellm $ARGS "$@"
26 changes: 0 additions & 26 deletions scripts/start_scalellm.sh

This file was deleted.

24 changes: 12 additions & 12 deletions src/request/sequence.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -38,19 +38,8 @@ bool Sequence::append_new_token_id(int32_t next_token_id) {
if (is_finished_) {
return false;
}
// check against stopping criterias
const size_t generated_tokens = token_ids_.size() - num_prompt_tokens_;
const size_t max_new_tokens = stopping_criteria_->max_tokens;
if (max_new_tokens > 0 && (generated_tokens + 1) >= max_new_tokens) {
// add the last token then mark the sequence as finished
cache_pos_ = token_ids_.size();
token_ids_.push_back(next_token_id);

finish_reason_ = FinishReason::LENGTH;
is_finished_ = true;
return false;
}

// check eos and stop tokens ids first
if (!stopping_criteria_->ignore_eos_token &&
next_token_id == stopping_criteria_->eos_token_id) {
finish_reason_ = FinishReason::STOP;
Expand All @@ -67,6 +56,17 @@ bool Sequence::append_new_token_id(int32_t next_token_id) {
// all tokens before pos should be processed and cached.
cache_pos_ = token_ids_.size();
token_ids_.push_back(next_token_id);

// check against max tokens
const size_t generated_tokens = token_ids_.size() - num_prompt_tokens_;
const size_t max_new_tokens = stopping_criteria_->max_tokens;
if (max_new_tokens > 0 && generated_tokens >= max_new_tokens) {
finish_reason_ = FinishReason::LENGTH;
is_finished_ = true;
return false;
}

// return true if the sequence is not finished
return true;
}

Expand Down

0 comments on commit 8d30cfe

Please sign in to comment.