Skip to content

Commit

Permalink
Reduce CI noise (#150)
Browse files Browse the repository at this point in the history
* erand48() does not exist on windows (so much for <stdlib.h> ...)

* Deactivate fuzzing big sizes of matmul - pending #149

* Add setup and bench on GCC 8 as GCC10 OpenMP is broken

* size 2 also tends to stall?

* C++ compat shim

* For now skip the full GEMM "fuzz-like" tests.
  • Loading branch information
mratsim committed May 20, 2020
1 parent 17257c2 commit 33a446c
Show file tree
Hide file tree
Showing 4 changed files with 75 additions and 22 deletions.
2 changes: 1 addition & 1 deletion benchmarks/matmul_gemm_blas/test_gemm_output.nim
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ proc testVsReference*(M, N, K: int) =
when isMainModule:
randomize(42) # For reproducibility

const sizes = [2,3,9,37,129,700]
const sizes = [2,3,9,37,129,700] # TODO: random syncScope stalls in CI https://github.com/mratsim/weave/issues/149

init(Weave)
for M in sizes:
Expand Down
65 changes: 53 additions & 12 deletions demos/raytracing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,30 +27,71 @@ compared to the single-threaded or the single parallel-for versions.
Note: except for the nested parallelism which has RNG issue,
the Nim and C++ versions are pixel equivalent.

### Setup

CPU: i9-9980XE, 18 cores, overclocked at 4.1GHz all-core turbo (from 3.0 nominal)
The code was compiled with default flag, hence x86-64, hence SSE2.

- Nim devel (1.3.5 2020-05-16) + GCC v10.1.0
- `nim c --threads:off -d:danger`
- `nim c --threads:on -d:danger`
- GCC v10.1.0
- `-03`
- `-O3`
- `-O3 -fopenmp`
- GCC v8.4.0
- `-O3`
- `-O3 -fopenmp`
- Clang v10.0.0
- `-03`
- `-O3`
- `-O3 -fopenmp`

| Bench | Nim | Clang C++ | GCC C++ |
|------------------|------------:|----------:|------------:|
| Single-threaded | 4min43.369s | 4m51.052s | 4min50.934s |
| Multithreaded | 13.211s | 14.428s | 2min14.616s |
| Nested-parallel | 12.981s | | |
| Parallel speedup | 21.83x | 20.17x | 2.16x |
### Commands


```bash
git clone https://github.com/mratsim/weave
cd weave
nimble install -y # install Weave dependencies, here synthesis, overwriting if asked.

nim -v # Ensure you have nim 1.2.0 or more recent

# Threads on (by default in this repo)
nim c -d:danger -o:build/ray_threaded demos/raytracing/smallpt.nim

# Threads off
nim c -d:danger --threads:off -o:build/ray_single demos/raytracing/smallpt.nim

g++ -O3 -o build/ray_gcc_single demos/raytracing/smallpt.cpp
g++ -O3 -fopenmp -o build/ray_gcc_omp demos/raytracing/smallpt.cpp

clang++ -O3 -o build/ray_clang_single demos/raytracing/smallpt.cpp
clang++ -O3 -fopenmp -o build/ray_clang_single demos/raytracing/smallpt.cpp
```

### Results & Analysis

GCC 10 has a significant OpenMP regression

| Bench | Nim | Clang C++ OpenMP | GCC 10 C++ OpenMP | GCC 8 C++ OpenMP |
| ---------------- | ----------: | ---------------: | ----------------: | ---------------: |
| Single-threaded | 4min43.369s | 4m51.052s | 4min50.934s | 4m50.648s |
| Multithreaded | 12.977s | 14.428s | 2min14.616s | 12.244s |
| Nested-parallel | 12.981s | | | |
| Parallel speedup | 21.83x | 20.17x | 2.16x | 23.74x |

Single-threaded Nim is 2.7% faster than Clang C++.
Multithreaded Nim via Weave is 11.1% faster Clang C++.

GCC 8 despite a simpler OpenMP design (usage of a global task queue instead of work-stealing)
achieves a better speedup than both Weave and Clang.
In that case, I expect it's because the tasks are so big that there is minimal contention
on the task queue, furthermore the OpenMP schedule is "Dynamic" so we avoid the worst case scenario
with static scheduling where a bunch of threads are assigned easy rays that never collide with a surface
and a couple of threads are drowned in complex rays.

Single-threaded Nim is 2.7% faster than Clang C++
Multithreaded Nim via Weave is 11.1% faster Clang C++
I have absolutely no idea of what happened to OpenMP in GCC 10.

Note: I only have 18 cores but we observe over 18x speedup
Note: I only have 18 cores but we observe speedups in the 20x
with Weave and LLVM. This is probably due to 2 factors:
- Raytracing is pure compute, in particular contrary to high-performance computing
and machine learning workloads which are also very memory-intensive (matrices and tensors with thousands to millions of elements)
Expand All @@ -65,7 +106,7 @@ with Weave and LLVM. This is probably due to 2 factors:

## License

Kevin beason code is licensed under (mail redacted to avoid spam)
Kevin Beason code is licensed under (mail redacted to avoid spam)

```
LICENSE
Expand Down
10 changes: 9 additions & 1 deletion demos/raytracing/smallpt.nim
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,15 @@ func intersect(r: Ray, t: var float64, id: var int32): bool =
id = i.int32
return t < inf

proc erand48(xi: var array[3, cushort]): cdouble {.importc, header:"<stdlib.h>", sideeffect.} # Need the same RNG for comparison
when defined(cpp):
# Seems like Nim codegen for mutable arrays is slightly different from the C++ API
# and needs a compatibility shim
proc erand48(xi: ptr cushort): cdouble {.importc, header:"<stdlib.h>", sideeffect.}
proc erand48(xi: var array[3, cushort]): float64 {.inline.} =
erand48(xi[0].addr)
else:
# Need the same RNG for comparison
proc erand48(xi: var array[3, cushort]): cdouble {.importc, header:"<stdlib.h>", sideeffect.}

proc radiance(r: Ray, depth: int32, xi: var array[3, cushort]): Vec =
var t: float64 # distance to intersection
Expand Down
20 changes: 12 additions & 8 deletions weave.nimble
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,8 @@ task test, "Run Weave tests":

test "-d:WV_LazyFlowvar", "tests/test_background_jobs.nim"

test "", "demos/raytracing/smallpt.nim"
when not defined(windows): # Does not support erand48
test "", "demos/raytracing/smallpt.nim"

test "", "benchmarks/dfs/weave_dfs.nim"
test "", "benchmarks/fibonacci/weave_fib.nim"
Expand Down Expand Up @@ -90,9 +91,10 @@ task test, "Run Weave tests":
# - spawn
# - spawnDelayed by pledges
# - syncScope
when not defined(windows) and (defined(i386) or defined(amd64)):
if not existsEnv"TEST_LANG" or getEnv"TEST_LANG" != "cpp":
test "-d:danger", "benchmarks/matmul_gemm_blas/test_gemm_output.nim"
when false: # TODO, not sure why this stalls why the gemm_weave_nestable don't - https://github.com/mratsim/weave/pull/150
when not defined(windows) and (defined(i386) or defined(amd64)):
if not existsEnv"TEST_LANG" or getEnv"TEST_LANG" != "cpp":
test "-d:danger", "benchmarks/matmul_gemm_blas/test_gemm_output.nim"

task test_gc_arc, "Run Weave tests with --gc:arc":
test "--gc:arc", "weave/cross_thread_com/channels_spsc_single.nim"
Expand Down Expand Up @@ -123,7 +125,8 @@ task test_gc_arc, "Run Weave tests with --gc:arc":

test "--gc:arc -d:WV_LazyFlowvar", "tests/test_background_jobs.nim"

test "--gc:arc", "demos/raytracing/smallpt.nim"
when not defined(windows):
test "--gc:arc", "demos/raytracing/smallpt.nim"

test "--gc:arc", "benchmarks/dfs/weave_dfs.nim"
test "--gc:arc", "benchmarks/fibonacci/weave_fib.nim"
Expand Down Expand Up @@ -158,9 +161,10 @@ task test_gc_arc, "Run Weave tests with --gc:arc":
# - spawn
# - spawnDelayed by pledges
# - syncScope
when not defined(windows) and (defined(i386) or defined(amd64)):
if not existsEnv"TEST_LANG" or getEnv"TEST_LANG" != "cpp":
test "-d:danger", "benchmarks/matmul_gemm_blas/test_gemm_output.nim"
when false: # TODO, not sure why this stalls why the gemm_weave_nestable don't - https://github.com/mratsim/weave/pull/150
when not defined(windows) and (defined(i386) or defined(amd64)):
if not existsEnv"TEST_LANG" or getEnv"TEST_LANG" != "cpp":
test "-d:danger", "benchmarks/matmul_gemm_blas/test_gemm_output.nim"

task gen_book, "Generate Weave documentation":
exec "mdbook build docs"
Expand Down

0 comments on commit 33a446c

Please sign in to comment.