Why is the following simple parallelized code much slower than a simple loop in Python?

后端未结

关注

 2  1018

小鲜肉 2020-12-19 09:24

A simple program which calculates square of numbers and stores the results:

    import time
    from joblib import Parallel, delayed
    import multiprocessi


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   攒了一身酷
                                             
                
                
                (楼主)
            
              
              
                2020-12-19 10:04
              

            
            
                        
Why?
Because trying to use tools in cases,
where tools principally cannot and DO NOT adjust the costs of entry:

I love python.

I pray educators better explain the costs of tools, otherwise we get lost in these wish-to-get [PARALLEL]-schedules.

A few facts:

No.0: With a lot of simplification, python intentionally uses GIL to [SERIAL]-ise access to variables and thus avoiding any potential collision from [CONCURRENT] modifications - paying these add-on costs of GIL-stepped dancing in extra time

No.1: [PARALLEL]-code execution is way harder than a "just"-[CONCURRENT] ( read more )

No.2: [SERIAL]-process has to pay extra costs, if trying to split work onto [CONCURRENT]-workers
No.3: If a process does inter-worker communication, immense extra costs per data exchange are paid
No.4: If hardware has few resources for [CONCURRENT] processes, results get way worse further



To have some smell of what can be done in standard python 2.7.13:

Efficiency is in better using silicon, not in bulldozing syntax-constructors into territories, where they are legal, but their performance has adverse effects on the experiment-under-test end-to-end speed:

You pay about 8 ~ 11 [ms] just to iteratively assemble an empty array1

>>> from zmq import Stopwatch
>>> aClk = Stopwatch()
>>> aClk.start();array1 = [ 0 for i in xrange( 100000 ) ];aClk.stop()
 9751L
10146L
10625L
 9942L
10346L
 9359L
10473L
 9171L
 8328L


_{( the Stopwatch().stop() method yields [us] from .start() )}

while, the memory-efficient, vectorisable, GIL-free approach can do the same about +230x ~ +450x faster:

>>> import numpy as np
>>>
>>> aClk.start();arrayNP = np.zeros( 100000 );aClk.stop()
   15L
   22L
   21L
   23L
   19L
   22L

>>> aClk.start();arrayNP = np.zeros( 100000, dtype = np.int );aClk.stop()
   43L
   47L
   42L
   44L
   47L


So, using the proper tools just starts the story of performance:

>>> def test_SERIAL_python( nLOOPs = 100000 ):
...     aClk.start()
...     for i in xrange( nLOOPs ):           # py3 range() ~ xrange() in py27 
...         array1[i] = i**2                 # your loop-code
...     _ = aClk.stop()
...     return _


While a naive [SERIAL]-iterative implementation works, you pay immense costs for opting to do so ~ 70 [ms] for a 100000-D vector:

>>> test_SERIAL_python( nLOOPs = 100000 )
 70318L
 69211L
 77825L
 70943L
 74834L
 73079L


Using a more suitable / appropriate tool costs just ~ 0.2 [ms]
 i.e. ++350x FASTER

>>> aClk.start();arrayNP[:] = arrayNP[:]**2;aClk.stop()
189L
171L
173L
187L
183L
188L
193L


and with another glitch, a.k.a. an inplace modus-operandi:

>>> aClk.start();arrayNP[:] *=arrayNP[:];aClk.stop()
138L
139L
136L
137L
136L
136L
137L


Yields ~ +514x SPEEDUP, just from using appropriate tool

The art of performance is not in following marketing-sounding claims
about parallellizing-( at-any-cost ),
but in using know-how based methods, that pay least costs for biggest speedups achievable.

For "small"-problems, typical costs of distributing "thin"-work-packages are indeed hard to get covered by any potentially achievable speedups, so "problem-size" actually limits one's choice of methods, that could reach positive gain ( speedups of 0.9 or even << 1.0 are so often reported here, on StackOverflow, that you need not feel lost or alone in this sort of surprise ).



Epilogue

Processor number counts.

Core number counts.

But cache-sizes + NUMA-irregularities count more than that.

Smart, vectorised, HPC-cured, GIL-free libraries matter 
( numpy et al - thanks a lot Travis OLIPHANT & al ... Great Salute to his team ... )



As an overhead-strict Amdahl Law (re-)-formulation explains, why even many-N-CPU parallelised code execution may ( and indeed often does ) suffer from speedups << 1.0

Overhead-strict formulation of the Amdahl's Law speedup S includes the very costs of the paid [PAR]-Setup + [PAR]-Terminate Overheads, explicitly:

               1
S =  __________________________; where s, ( 1 - s ), N were defined above
                ( 1 - s )            pSO:= [PAR]-Setup-Overhead     add-on
     s  + pSO + _________ + pTO      pTO:= [PAR]-Terminate-Overhead add-on
                    N               


( an interactive animated tool for 2D visualising effects of these performance constraints is cited here )
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复

Why is the following simple parallelized code much slower than a simple loop in Python?

Why?Because trying to use tools in cases,where tools principally cannot and DO NOT adjust the costs of entry:

A few facts:

To have some smell of what can be done in standard python 2.7.13:

So, using the proper tools just starts the story of performance:

Using a more suitable / appropriate tool costs just ~ 0.2 [ms] i.e. ++350x FASTER

Yields ~ +514x SPEEDUP, just from using appropriate tool

Epilogue

Why?
Because trying to use tools in cases,
where tools principally cannot and DO NOT adjust the costs of entry:

Using a more suitable / appropriate tool costs just ~ 0.2 [ms]
i.e. ++350x FASTER