Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaCe slower than Numba #1540

Open
saleem-akhtar opened this issue Feb 29, 2024 · 0 comments
Open

DaCe slower than Numba #1540

saleem-akhtar opened this issue Feb 29, 2024 · 0 comments

Comments

@saleem-akhtar
Copy link

Describe the bug
Seeing different performance when compared to numba than suggested in DaCe notebook.

To Reproduce
https://nbviewer.org/github/spcl/dace/blob/master/tutorials/benchmarking.ipynb
In the above notebook, it is suggest that the following code should run faster on DaCe than it does on numba:

import numba

@njit
def element_update_numba(a):
    return a * 5

def element_update_dace(a):
    return a * 5

def element_update(a):
    return a * 5


def someforloop(A):
    for i in range(1000):
        for j in range(1000):
            A[i, j] = element_update(A[i, j])

@njit(parallel=True)
def someforloop_numba_parallel(A):
    for i in numba.prange(1000):
        for j in numba.prange(1000):
            A[i, j] = element_update_numba(A[i, j])

@njit
def someforloop_numba(A):
    for i in range(1000):
        for j in range(1000):
            A[i, j] = element_update_numba(A[i, j])


@dace.program(auto_optimize=True, device=dace.DeviceType.CPU)
def someforloop_dace_parallel(A: dace.float64[1000, 1000]):
    for i in dace.map[0: 1000]:
        for j in dace.map[0: 1000]:
            A[i, j] = element_update_dace(A[i, j])

@dace.program(auto_optimize=True, device=dace.DeviceType.CPU)
def someforloop_dace(A: dace.float64[1000, 1000]):
    for i in range(1000):
        for j in range(1000):
            A[i, j] = element_update_dace(A[i, j])

someforloop_dace_parallel_compiled = someforloop_dace_parallel.compile()
someforloop_dace_compiled = someforloop_dace.compile()

a_orig = np.random.rand(1000, 1000)
TIMES = {}
a = a_orig.copy()
TIMES['numpy'] = %timeit -o someforloop(a)
a = a_orig.copy()
TIMES['numba'] = %timeit -o someforloop_numba(a)
a = a_orig.copy()
TIMES['numba_parallel'] = %timeit -o someforloop_numba_parallel(a)
a = a_orig.copy()
TIMES['dace_njit'] = %timeit -o someforloop_dace(a)
a = a_orig.copy()
TIMES['dace_compiled'] = %timeit -o someforloop_dace_compiled(a)
a = a_orig.copy()
TIMES['dace_parallel_njit'] = %timeit -o someforloop_dace_parallel(a)
a = a_orig.copy()
TIMES['dace_parell_compiled'] = %timeit -o someforloop_dace_parallel_compiled(a)

However I get the following results:

numpy: 285 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numba: 174 µs ± 4.64 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
numba_parallel: 63.9 µs ± 5.91 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
dace_njit: 725 µs ± 39 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
dace_compiled: 161 µs ± 3.59 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
dace_parallel_njit: 679 µs ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
dace_parallel_compiled: 90.1 µs ± 4.89 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

I'm wondering if there are any optimisations I'm missing / installations on my laptop missing that will make this code run faster?

Expected behavior
DaCe to be faster

Desktop (please complete the following information):

  • OS: Windows 10
  • IDE: VSCode
  • Version: latest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant