Why is the following simple parallelized code much slower than a simple loop in Python?

后端 未结 2 1018
小鲜肉
小鲜肉 2020-12-19 09:24

A simple program which calculates square of numbers and stores the results:

    import time
    from joblib import Parallel, delayed
    import multiprocessi         


        
2条回答
  •  攒了一身酷
    2020-12-19 10:04

    Why?
    Because trying to use tools in cases,
    where tools principally cannot and DO NOT adjust the costs of entry:

    I love python.
    I pray educators better explain the costs of tools, otherwise we get lost in these wish-to-get [PARALLEL]-schedules.

    A few facts:

    No.0: With a lot of simplification, python intentionally uses GIL to [SERIAL]-ise access to variables and thus avoiding any potential collision from [CONCURRENT] modifications - paying these add-on costs of GIL-stepped dancing in extra time
    No.1: [PARALLEL]-code execution is way harder than a "just"-[CONCURRENT] ( read more )
    No.2: [SERIAL]-process has to pay extra costs, if trying to split work onto [CONCURRENT]-workers No.3: If a process does inter-worker communication, immense extra costs per data exchange are paid No.4: If hardware has few resources for [CONCURRENT] processes, results get way worse further


    To have some smell of what can be done in standard python 2.7.13:

    Efficiency is in better using silicon, not in bulldozing syntax-constructors into territories, where they are legal, but their performance has adverse effects on the experiment-under-test end-to-end speed:

    You pay about 8 ~ 11 [ms] just to iteratively assemble an empty array1

    >>> from zmq import Stopwatch
    >>> aClk = Stopwatch()
    >>> aClk.start();array1 = [ 0 for i in xrange( 100000 ) ];aClk.stop()
     9751L
    10146L
    10625L
     9942L
    10346L
     9359L
    10473L
     9171L
     8328L
    

    ( the Stopwatch().stop() method yields [us] from .start() )
    while, the memory-efficient, vectorisable, GIL-free approach can do the same about +230x ~ +450x faster:

    >>> import numpy as np
    >>>
    >>> aClk.start();arrayNP = np.zeros( 100000 );aClk.stop()
       15L
       22L
       21L
       23L
       19L
       22L
    
    >>> aClk.start();arrayNP = np.zeros( 100000, dtype = np.int );aClk.stop()
       43L
       47L
       42L
       44L
       47L
    

    So, using the proper tools just starts the story of performance:

    >>> def test_SERIAL_python( nLOOPs = 100000 ):
    ...     aClk.start()
    ...     for i in xrange( nLOOPs ):           # py3 range() ~ xrange() in py27 
    ...         array1[i] = i**2                 # your loop-code
    ...     _ = aClk.stop()
    ...     return _
    

    While a naive [SERIAL]-iterative implementation works, you pay immense costs for opting to do so ~ 70 [ms] for a 100000-D vector:

    >>> test_SERIAL_python( nLOOPs = 100000 )
     70318L
     69211L
     77825L
     70943L
     74834L
     73079L
    

    Using a more suitable / appropriate tool costs just ~ 0.2 [ms]
    i.e. ++350x FASTER

    >>> aClk.start();arrayNP[:] = arrayNP[:]**2;aClk.stop()
    189L
    171L
    173L
    187L
    183L
    188L
    193L
    

    and with another glitch, a.k.a. an inplace modus-operandi:

    >>> aClk.start();arrayNP[:] *=arrayNP[:];aClk.stop()
    138L
    139L
    136L
    137L
    136L
    136L
    137L
    

    Yields ~ +514x SPEEDUP, just from using appropriate tool

    The art of performance is not in following marketing-sounding claims
    about parallellizing-( at-any-cost ),
    but in using know-how based methods, that pay least costs for biggest speedups achievable.

    For "small"-problems, typical costs of distributing "thin"-work-packages are indeed hard to get covered by any potentially achievable speedups, so "problem-size" actually limits one's choice of methods, that could reach positive gain ( speedups of 0.9 or even << 1.0 are so often reported here, on StackOverflow, that you need not feel lost or alone in this sort of surprise ).


    Epilogue

    Processor number counts.
    Core number counts.
    But cache-sizes + NUMA-irregularities count more than that.
    Smart, vectorised, HPC-cured, GIL-free libraries matter
    ( numpy et al - thanks a lot Travis OLIPHANT & al ... Great Salute to his team ... )


    As an overhead-strict Amdahl Law (re-)-formulation explains, why even many-N-CPU parallelised code execution may ( and indeed often does ) suffer from speedups << 1.0

    Overhead-strict formulation of the Amdahl's Law speedup S includes the very costs of the paid [PAR]-Setup + [PAR]-Terminate Overheads, explicitly:

                   1
    S =  __________________________; where s, ( 1 - s ), N were defined above
                    ( 1 - s )            pSO:= [PAR]-Setup-Overhead     add-on
         s  + pSO + _________ + pTO      pTO:= [PAR]-Terminate-Overhead add-on
                        N               
    

    ( an interactive animated tool for 2D visualising effects of these performance constraints is cited here )

提交回复
热议问题