Skip to content

Commit

Permalink
Added BLIS support, fixes, updated documentation
Browse files Browse the repository at this point in the history
- Added Support for BLIS
- Fixed CBLAS Support
- Updated Documentation
  • Loading branch information
trholding committed Aug 4, 2023
1 parent af5c2c7 commit 324edf8
Show file tree
Hide file tree
Showing 3 changed files with 99 additions and 21 deletions.
13 changes: 12 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
# Libraries
# BLIS
BLIS_PREFIX = /usr/local
BLIS_INC = $(BLIS_PREFIX)/include/blis
BLIS_LIB = $(BLIS_PREFIX)/lib/libblis.a

# choose your compiler, e.g. gcc/clang
# example override to clang: make run CC=clang

CC = gcc

# the most basic way of building that is most likely to work on most systems
Expand Down Expand Up @@ -55,7 +62,11 @@ runopenblas: run.c

.PHONY: runblas
runblas: run.c
$(CC) -D OPENBLAS -Ofast -fopenmp -march=native run.c -lm -lcblas -o run
$(CC) -D CBLAS -Ofast -fopenmp -march=native run.c -lm -lcblas -o run

.PHONY: runblis
runblis: run.c
$(CC) -D BLIS -Ofast -fopenmp -march=native -I$(BLIS_INC) run.c -lm -lblis -o run

.PHONY: cosmorun
cosmorun:
Expand Down
92 changes: 78 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,20 @@ Building this with gcc or clang would result in normal binaries similar to upstr
Read more:
[How to build](https://github.com/trholding/llama2.c#portable-binary-build)

- [x] Output Token Buffer
- [x] Openblas
- [x] CLBLAST - GPU
### Performance feature support
+ CPU
- [x] OpenBLAS
- [x] CBLAS
- [x] BLIS

CPU/GPU
- [x] OpenMP (CPU)
- [] OpenACC

+ GPU
- [x] OpenCL (via CLBlast)
- [] Vulkan
- [] CUDA

Download the prebuilt run.com binary from releases

Expand All @@ -64,7 +75,7 @@ wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
You can also prompt the model with a prefix:

```bash
./run stories42M.bin 1.0 256 "A big dog"
./run stories42M.bin -t 1.0 -s 256 -p "A big dog"
```

When prompting, the temperature and steps parameters are needed since we use simple positional arguments.
Expand Down Expand Up @@ -139,6 +150,62 @@ Build on Linux and Windows:

The MSVC build will use openmp and max threads suitable for your CPU unless you set `OMP_NUM_THREADS` env.

### Build wth acceleration

**OpenMP**

This build enables acceleration via OpenMP

```bash
make runomp
```

+ Requires [OpenMP](https://www.openmp.org/) libraries and compiler with OpenMP support to be available on system.

**OpenBLAS**

This build enables acceleration via OpenBLAS

```bash
make runopenblas
```

+ Requires [OpenBLAS](https://github.com/xianyi/OpenBLAS)to be installed on system.

**BLIS**

This build enables acceleration via BLIS

```bash
make runblis
```
+ Requires [BLIS](https://github.com/flame/blis) compiled with `./configure --enable-cblas -t openmp,pthreads auto` to be installed on system.


**Generic CBLAS**

This build enables acceleration with any Netlib CBLAS interface compatible libraries

```bash
make runblas
```

+ Requires any BLAS library with Netlib CBLAS interface such as [LAPACK](https://www.netlib.org/lapack) to be installed on system.


**CLBlast (GPU/OpenCL)**

This build enables tuned GPU acceleration via OpenCL with CLBlast

```bash
make runclblast
```

+ Requires [ClBlast](https://github.com/CNugteren/CLBlast) compiled with `cmake -DNETLIB=ON` to be installed on system.

Note: Currently runs much slower than CPU! Requires investigation or memory is a bottle neck on the test system.


## Portable Binary Build

Have you ever wanted to inference a baby Llama 2 model with a single executable on any OS or *as OS? No? Well, now you can!
Expand Down Expand Up @@ -184,25 +251,22 @@ sudo ln -sf /opt/cosmo/tool/scripts/cosmoc++ /opt/cosmos/bin/cosmoc++

Example build to generate a Actually Portable Executable (APE):

```
$ cosmocc -O3 -Ofast -funsafe-math-optimizations -ffast-math -D COSMO_BLINK \
-D COSMO_METAL -D COSMO_ZIP -o run.com run.c -lm
Add model.bin and tokenizer.bin to executable:
$ zip run.com out/model.bin
$ zip run.com tokenizer.bin
```bash
make cosmorun
```

Run or copy to any supported system and run:

```
If model is embedded:

$ ./run.com
```bash
./run.com
```

Else

$ ./run.com model.bin
```bash
/run.com model.bin
```

## contributing
Expand Down
15 changes: 9 additions & 6 deletions run.c
Original file line number Diff line number Diff line change
Expand Up @@ -38,15 +38,18 @@ __static_yoink("zipos");
// ----------------------------------------------------------------------------
// BLAS Support

#ifdef CLBLAST
#include <clblast_netlib_c.h>
#if defined(CLBLAST) || defined(OPENBLAS) || defined(CBLAS) || defined(BLIS)
#define BLAS
#endif

#ifdef OPENBLAS
#ifdef CLBLAST
#include <clblast_netlib_c.h>
#elif defined(BLIS)
#include "blis.h"
#include "cblas.h"
#elif defined(OPENBLAS) || defined(CBLAS)
#include <cblas.h>
#define BLAS
#endif
#endif

// ----------------------------------------------------------------------------
// Standard Headers
Expand Down Expand Up @@ -622,7 +625,7 @@ int main(int argc, char *argv[]) {
int token = 1; // init with token 1 (=BOS), as done in Llama-2 sentencepiece tokenizer
int pos = 0; // position in the sequence
int bufferflush = 1; // token counter for flushing buffer
char outbuff[4096 * (6 + 2)] ; // buffersize is context length * average size of subwords + margin
static char outbuff[4096 * (6 + 2)] ; // buffersize is context length * average size of subwords + margin
printf("<s>\n"); // explicit print the initial BOS token for stylistic symmetry reasons

// setvbuf is used to buffer output into outbuff instead of flushing to screen directly
Expand Down

0 comments on commit 324edf8

Please sign in to comment.