Skip to content

mkl_umath does not bring performance benefits relative to vanilla numpy #1

@samaid

Description

@samaid

https://gist.github.com/samaid/bb680421ee29926cc7b8e536ee9a931c

Test was run on Intel DevCloud on TGL node in two setups

  1. STOCK: Clean environment with numpy installed from -c conda-forge
  2. INTEL: Clean environment with numpy installed from -c intel
(intel) u184071@s019-n016:~/repos/dpnp-umath$ python test.py
NP: [0.18639128 0.10316299 0.25168699 ... 0.11474663 0.59490342 0.68693815]
Buffer size: 8192
0.3318898677825928
NP: [0.18639128 0.10316299 0.25168699 ... 0.11474663 0.59490342 0.68693815]
Buffer size: 1600000
0.3113992214202881
UM: [0.18639128 0.10316299 0.25168699 ... 0.11474663 0.59490342 0.68693815]
0.30924224853515625 

(condaforge) u184071@s019-n016:~/repos/dpnp-umath$ python test.py
NP: [0.71962608 0.53769131 0.39456384 ... 0.20209085 0.19296594 0.17458681]
Buffer size: 8192
0.3226659297943115
NP: [0.71962608 0.53769131 0.39456384 ... 0.20209085 0.19296594 0.17458681]
Buffer size: 1600000
0.32870054244995117
No mkl_umath found. Skipping test...

NumPy performance difference between stock and intel is not observed on default buffer size, and only marginally better when numpy.setbufsize() is set to 16*10^5.

This behavior is not observed on SPR node in Intel DevCloud:

(intel) u184071@s018-n003:~/repos/dpnp-umath$ python test.py
NP: [0.71095155 0.23050819 0.1467021  ... 0.26945045 0.18541328 0.83865669]
Buffer size: 8192
0.4312753677368164
NP: [0.71095155 0.23050819 0.1467021  ... 0.26945045 0.18541328 0.83865669]
Buffer size: 1600000
0.04172515869140625
UM: [0.71095155 0.23050819 0.1467021  ... 0.26945045 0.18541328 0.83865669]
0.03204202651977539

(condaforge) u184071@s018-n003:~/repos/dpnp-umath$ python test.py
NP: [0.74352341 0.67897181 0.80952154 ... 0.02458932 0.78159    0.10357044]
Buffer size: 8192
0.34731459617614746
NP: [0.74352341 0.67897181 0.80952154 ... 0.02458932 0.78159    0.10357044]
Buffer size: 1600000
0.3502378463745117
No mkl_umath found. Skipping test..

Looks like no multithreading is exercised on TGL system. Second, default buffer size is too small to get any benefits from multi-threading. According to this chart, multithreading is beneficial with the buffer size greater than 10K and the performance is materially different on sizes 100K-1M:
https://www.intel.com/content/www/us/en/develop/documentation/onemkl-vmperfdata/top/real-functions/trigonometric/sin.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions