Numpy performance differences depending on numerical values

后端 未结 2 683
天命终不由人
天命终不由人 2021-01-02 13:05

I found a strange performance difference while evaluating an expression in Numpy.

I executed the following code:

import numpy as np
myarr = np.random         


        
相关标签:
2条回答
  • 2021-01-02 14:00

    Use Intel SVML

    I have no working numexpr with Intel SVML, but numexpr with working SVML should perform as good as Numba. The Numba Benchmarks show quite the same behaviour without SVML, but perform much better with SVML.

    Code

    import numpy as np
    import numba as nb
    
    myarr = np.random.uniform(-1,1,[1100,1100])
    
    @nb.njit(error_model="numpy",parallel=True)
    def func(arr,div):
      return np.exp( - 0.5 * (myarr / div)**2 )
    

    Timings

    #Core i7 4771
    #Windows 7 x64
    #Anaconda Python 3.5.5
    #Numba 0.41 (compilation overhead excluded)
    func(myarr,0.1)                      -> 3.6ms
    func(myarr,0.001)                    -> 3.8ms
    
    #Numba (set NUMBA_DISABLE_INTEL_SVML=1), parallel=True
    func(myarr,0.1)                      -> 5.19ms
    func(myarr,0.001)                    -> 12.0ms
    
    #Numba (set NUMBA_DISABLE_INTEL_SVML=1), parallel=False
    func(myarr,0.1)                      -> 16.7ms
    func(myarr,0.001)                    -> 63.2ms
    
    #Numpy (1.13.3), set OMP_NUM_THREADS=4
    np.exp( - 0.5 * (myarr / 0.001)**2 ) -> 70.82ms
    np.exp( - 0.5 * (myarr / 0.1)**2 )   -> 12.58ms
    
    #Numpy (1.13.3), set OMP_NUM_THREADS=1
    np.exp( - 0.5 * (myarr / 0.001)**2 ) -> 189.4ms
    np.exp( - 0.5 * (myarr / 0.1)**2 )   -> 17.4ms
    
    #Numexpr (2.6.8), no SVML, parallel
    ne.evaluate("exp( - 0.5 * (myarr / 0.001)**2 )") ->17.2ms
    ne.evaluate("exp( - 0.5 * (myarr / 0.1)**2 )")   ->4.38ms
    
    #Numexpr (2.6.8), no SVML, single threaded
    ne.evaluate("exp( - 0.5 * (myarr / 0.001)**2 )") ->50.85ms
    ne.evaluate("exp( - 0.5 * (myarr / 0.1)**2 )")   ->13.9ms
    
    0 讨论(0)
  • 2021-01-02 14:03

    This may produce denormalised numbers which slow down computations.

    You may like to disable denormalized numbers using daz library:

    import daz
    daz.set_daz()
    

    More info: x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ):

    To avoid serialization and performance issues due to denormals and underflow numbers, use the SSE and SSE2 instructions to set Flush-to-Zero and Denormals-Are-Zero modes within the hardware to enable highest performance for floating-point applications.

    Note that in 64-bit mode floating point computations use SSE instructions, not x87.

    0 讨论(0)
提交回复
热议问题