Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master'
Browse files Browse the repository at this point in the history
  • Loading branch information
trholding committed Jul 30, 2023
2 parents e5c1757 + f61807d commit 3235440
Show file tree
Hide file tree
Showing 11 changed files with 595 additions and 102 deletions.
96 changes: 96 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
name: Continuous Integration

on:
push:
branches:
- master
paths: ['.github/workflows/**', '**/Makefile', '**/*.c', '**/*.h']
pull_request:
types: [opened, synchronize, reopened]
paths: ['**/Makefile', '**/*.c', '**/*.h']

env:
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}

jobs:
# check basic builds to avoid breaking changes
ubuntu-focal-make:
runs-on: ubuntu-20.04

steps:
- name: Clone
id: checkout
uses: actions/checkout@v3

- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential -y
- name: Build
id: make_build
run: |
make
- name: Build runfast
id: make_build_runfast
run: |
make runfast
macOS-latest-make:
runs-on: macos-latest

steps:
- name: Clone
id: checkout
uses: actions/checkout@v3

- name: Dependencies
id: depends
continue-on-error: true
run: |
brew update
- name: Build
id: make_build
run: |
make
- name: Build runfast
id: make_build_runfast
run: |
make runfast
- name: Build clang
id: make_build_clang
run: |
make run CC=clang
windows-latest-make:
runs-on: windows-latest

strategy:
matrix:
arch:
- amd64
- amd64_x86
- amd64_arm64

steps:
- name: Clone
id: checkout
uses: actions/checkout@v3

- name: Setup MSBuild
uses: microsoft/setup-msbuild@v1

- name: Setup MSVC ${{ matrix.arch }}
uses: ilammy/msvc-dev-cmd@v1
with:
arch: ${{ matrix.arch }}

- name: Build ${{ matrix.arch }}
id: build_msvc
run: |
.\build_msvc.bat
13 changes: 13 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,19 @@ runfast: run.c
runomp: run.c
$(CC) -Ofast -fopenmp -march=native run.c -lm -o run

.PHONY: win64
win64:
x86_64-w64-mingw32-gcc-win32 -Ofast -D_WIN32 -o run.exe -I. run.c win.c

# compiles with gnu99 standard flags for amazon linux, coreos, etc. compatibility
.PHONY: rungnu
rungnu:
$(CC) -Ofast -std=gnu11 -o run run.c -lm

.PHONY: runompgnu
runompgnu:
$(CC) -Ofast -fopenmp -std=gnu11 run.c -lm -o run

.PHONY: clean
clean:
rm -f run
77 changes: 50 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,9 @@ Download the prebuilt run.com binary from releases

## llama2.c

<img src="assets/llama_cute.jpg" width="300" height="300">
<p align="center">
<img src="assets/llama_cute.jpg" width="300" height="300" alt="Cute Llama">
</p>

With the code in this repo you can train the Llama 2 LLM architecture from scratch in PyTorch, then export the weights to a binary file, and load that into one ~simple 500-line C file ([run.c](run.c)) that inferences the model. Alternatively, you can load, finetune, and inference Meta's Llama 2 (but this is still being actively fleshed out). Hence, this repo is a "fullstack" train + inference solution for Llama 2 LLM, with a focus on minimalism and simplicity. You might think that you need many billion parameter LLMs to do anything useful, but in fact very small LLMs can have surprisingly strong performance if you make the domain narrow enough. I recommend looking at the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) paper for inspiration.

Expand Down Expand Up @@ -67,11 +69,20 @@ This still runs at interactive rates and samples more coherent and diverse stori

> Once upon a time, there was a little girl named Lily. She loved playing with her toys on top of her bed. One day, she decided to have a tea party with her stuffed animals. She poured some tea into a tiny teapot and put it on top of the teapot. Suddenly, her little brother Max came into the room and wanted to join the tea party too. Lily didn't want to share her tea and she told Max to go away. Max started to cry and Lily felt bad. She decided to yield her tea party to Max and they both shared the teapot. But then, something unexpected happened. The teapot started to shake and wiggle. Lily and Max were scared and didn't know what to do. Suddenly, the teapot started to fly towards the ceiling and landed on the top of the bed. Lily and Max were amazed and they hugged each other. They realized that sharing was much more fun than being selfish. From that day on, they always shared their tea parties and toys.
You can also prompt the model with a prefix (sadly, because this is currently done via positional arguments, you also have to specify temperature 1.0 and 256 steps, before you enter the prompt):

```bash
./run stories42M.bin 1.0 256 "One day, Lily met a Shoggoth"
```

> One day, Lily met a Shoggoth. He was very shy, but was also very generous. Lily said “Hello Shoggy! Can I be your friend?” Shoggy was happy to have a friend and said “Yes, let’s explore the universe together!” So they set off on a journey to explore the universe. As they travelled, Shoggy was happy to explain to Lily about all the wonderful things in the universe. At the end of the day, Lily and Shoggy had gathered lots of wonderful things from the universe, and they both felt very proud. They promised to explore the universe as one big pair and to never stop being generous to each other.
There is also an even better 110M param model available, see [models](#models).

## Meta's Llama 2 models

As the neural net architecture is identical, we can also inference the Llama 2 models released by Meta. Sadly there is a bit of friction here due to licensing (I can't directly upload the checkpoints, I think). So Step 1, get the Llama 2 checkpoints by following the [Meta instructions](https://github.com/facebookresearch/llama). Once we have those checkpoints, we have to convert them into the llama2.c format. For this we use the `export_meta_llama_bin.py` file, e.g. for 7B model:
As the neural net architecture is identical, we can also inference the Llama 2 models released by Meta. Sadly there is a bit of friction here due to licensing (I can't directly upload the checkpoints, I think). So Step 1, get the Llama 2 checkpoints by following the [Meta instructions](https://github.com/facebookresearch/llama). Once we have those checkpoints, we have to convert them into the llama2.c format.
For this we need to install the python dependencies (`pip install -r requirements.txt`) and then use the `export_meta_llama_bin.py` file, e.g. for 7B model:

```bash
python export_meta_llama_bin.py path/to/llama/model/7B llama2_7b.bin
Expand All @@ -83,7 +94,7 @@ The export will take ~10 minutes or so and generate a 26GB file (the weights of
./run llama2_7b.bin
```

This ran at about 4 tokens/s compiled with OpenMP on 96 threads on my CPU Linux box in the cloud. (On my MacBook Air M1, currently it's closer to 30 seconds per token if you just build with `make runfast`.) Example output:
This ran at about 4 tokens/s compiled with [OpenMP](#OpenMP) on 96 threads on my CPU Linux box in the cloud. (On my MacBook Air M1, currently it's closer to 30 seconds per token if you just build with `make runfast`.) Example output:

> The purpose of this document is to highlight the state-of-the-art of CoO generation technologies, both recent developments and those in commercial use. The focus is on the technologies with the highest merit to become the dominating processes of the future and therefore to be technologies of interest to S&amp;T ... R&amp;D. As such, CoO generation technologies developed in Russia, Japan and Europe are described in some depth. The document starts with an introduction to cobalt oxides as complex products and a short view on cobalt as an essential material. The document continues with the discussion of the available CoO generation processes with respect to energy and capital consumption as well as to environmental damage.
Expand Down Expand Up @@ -152,41 +163,37 @@ $ pytest

## performance

*(NOTE: this guide is not great because I personally spend a lot of my time in Python land and don't have an amazing understanding of a lot of these features and flags. If someone does and is willing to help document and briefly describe some of these and their tradeoffs, I'd welcome a PR)*

There are many ways to potentially speed up this code depending on your system. Here we document a few together with a high-level guide on what they do. Here's again the default way to compile, but using -O3:
There are many ways to potentially speed up this code depending on your system. Have a look at the [Makefile](Makefile), which contains a lot of notes. The `make run` command currently uses the `-O3` optimization by default, i.e.:

```bash
gcc -O3 -o run run.c -lm
```

-O3 includes optimizations that are expensive in terms of compile time and memory usage. Including vectorization, loop unrolling, and predicting branches. Here's a few more to try.

`-Ofast` Run additional optimizations which may break compliance with the C/IEEE specifications, in addition to `-O3`. See [the GCC docs](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) for more information.
-O3 includes optimizations that are expensive in terms of compile time and memory usage. Including vectorization, loop unrolling, and predicting branches.

`-march=native` Compile the program to use the architecture of the machine you're compiling on rather than a more generic CPU. This may enable additional optimizations and hardware-specific tuning such as improved vector instructions/width.
To get a much better performance, try to compile with `make runfast`. This turns on the `-Ofast` flag, which includes additional optimizations that may break compliance with the C/IEEE specifications, in addition to `-O3`. See [the GCC docs](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) for more information.

The fastest throughput I saw so far on my MacBook Air (M1) is with:
Try `-march=native` to compile the program to use the architecture of the machine you're compiling on rather than a more generic CPU. This may enable additional optimizations and hardware-specific tuning such as improved vector instructions/width.

```bash
gcc -Ofast -o run run.c -lm
```
The fastest throughput I saw so far on my MacBook Air (M1) so far is with `make runfast`.

You can also experiment with replacing `gcc` with `clang`.

**OpenMP** Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention. You can compile e.g. like so:
### OpenMP
Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention, allowing the work in the loops to be split up over multiple processors.
You'll need to install the OpenMP library and the clang compiler first (e.g. `apt install clang libomp-dev` on ubuntu). I was not able to get improvements from OpenMP on my MacBook, though. Then you can compile with `make runomp`, which does:

```bash
clang -Ofast -fopenmp -march=native run.c -lm -o run
```

You can try swapping clang/gcc, and may try to leave out -march=native. However, when you run inference make sure to use OpenMP flags to set the number of threads, e.g.:
When you run inference make sure to use OpenMP flags to set the number of threads, e.g.:

```bash
OMP_NUM_THREADS=4 ./run out/model.bin
```

Depending on your system resources you may want to tweak these hyperparameters. (TODO: I am not intimately familiar with OpenMP and its configuration, if someone would like to flesh out this section I would welcome a PR).
Depending on your system resources you may want to tweak these hyperparameters and use more threads. But more is not always better, usually this is a bit U shaped.

## binary portability (even more magic)

Expand Down Expand Up @@ -256,16 +263,11 @@ $ ./run.com model.bin

## unsorted todos

- why is there a leading space in C sampling code when we `./run`?
- support Llama 2 Chat models, and tune run.c to Chat UI/UX
- possibly include emscripten / web backend (as seen in @gg PR)
- currently the project only runs in fp32, want to explore more reduced precision inference.
- todo multiquery support? doesn't seem as useful for smaller models that run on CPU (?)
- todo support inferencing beyond max_seq_len steps, have to think through the kv cache
- why is MFU so low (~10%) on my A100 40GB for training?
- weird errors with torch.compile and wandb when using DDP
- (LoRA) finetuning of Llama 2 models
- make more better tests to decrease yolo
## platforms

On **Windows**, use `build_msvc.bat` in a Visual Studio Command Prompt to build with msvc, or you can use `make win64` to use mingw compiler toolchain from linux or windows to build the windows target. MSVC build will automatically use openmp and max threads appropriate for your CPU unless you set `OMP_NUM_THREADS` env.

On **Centos 7**, **Amazon Linux 2018** use `rungnu` Makefile target: `make rungnu` or `make runompgnu` to use openmp.

## ack

Expand Down Expand Up @@ -296,6 +298,27 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
## notable forks

- [llama2.rs](https://github.com/gaxler/llama2.rs) by @gaxler: a Rust port of this project
- [go-llama2](https://github.com/tmc/go-llama2) by @tmc: a Go port of this project
- [llama2.go](https://github.com/nikolaydubina/llama2.go) by @nikolaydubina: a Go port of this project
- [llama2.go](https://github.com/haormj/llama2.go) by @haormj: a Go port of this project
- [llama2.go](https://github.com/saracen/llama2.go) by @saracen: a Go port of this project
- [llama2.c-android](https://github.com/Manuel030/llama2.c-android): by @Manuel030: adds Android binaries of this project
- [llama2.cpp](https://github.com/leloykun/llama2.cpp) by @leloykun: a C++ port of this project
- [llama2.js](https://github.com/epicure/llama2.js) by @epicure: a JavaScript port of this project

## unsorted todos

- support Llama 2 7B Chat model and tune run.c to Chat UI/UX
- speed up 7B Llama 2 models sufficiently to work at interactive rates on Apple Silicon MacBooks
- possibly include emscripten / web backend (as seen in @gg PR)
- currently the project only runs in fp32, how easy would it be to different precisions?
- look into quantization and what would be involved
- todo multiquery support? doesn't seem as useful for smaller models that run on CPU (?)
- todo support inferencing beyond max_seq_len steps, have to think through the kv cache
- why is MFU so low (~10%) on my A100 40GB for training?
- weird errors with torch.compile and wandb when using DDP
- (LoRA) finetuning of Llama 2 models
- make more better tests to decrease yolo

## License

Expand Down
1 change: 1 addition & 0 deletions build_msvc.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
cl.exe /Ox /openmp /I. run.c win.c
10 changes: 4 additions & 6 deletions export_meta_llama_bin.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
This script exports the Llama 2 weights in llama2c.bin format.
"""
import os
import sys
import struct
from pathlib import Path
Expand Down Expand Up @@ -89,16 +90,13 @@ def concat_weights(models):


def load_and_export(model_path, output_path):
with open(model_path + 'params.json') as f:
params_path = os.path.join(model_path, 'params.json')
with open(params_path) as f:
params = json.load(f)
print(params)

model_paths = sorted(list(Path(model_path).glob('consolidated.*.pth')))
models = []
for i in model_paths:
print(f'Loading {i}')
models.append(torch.load(i, map_location='cpu'))

models = [torch.load(p, map_location='cpu') for p in model_paths]
state_dict = concat_weights(models)
del models
export(params, state_dict, output_path)
Expand Down
Loading

0 comments on commit 3235440

Please sign in to comment.