Why is the following simple parallelized code much slower than a simple loop in Python?

后端未结

关注

 2  1023

A simple program which calculates square of numbers and stores the results:

    import time
    from joblib import Parallel, delayed
    import multiprocessi


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  -上瘾入骨i        
                
              
                            
                2020-12-19 09:56
              
            
            
                                                                       
From the documentation of threading:


  If you know that the function you are calling is based on a compiled
  extension that releases the Python Global Interpreter Lock (GIL)
  during most of its computation ...


The problem is that in the this case, you don't know that.  Python itself will only allow one thread to run at once (the python interpreter locks the GIL every time it executes a python operation).

threading is only going to be useful if myfun() spends most of its time in a compiled Python extension, and that extension releases the GIL.

The Parallel code is so embarrassingly slow because you are doing a huge amount of work to create multiple threads - and then you only execute one thread at a time anyway.

If you use the multiprocessing backend, then you have to copy the input data into each of four or eight processes (one per core), do the processing in each processes, and then copy the output data back.  The copying is going to be slow, but if the processing is a little bit more complex than just calculating a square, it might be worth it.  Measure and see.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  攒了一身酷        
                
              
                            
                2020-12-19 10:04
              
            
            
                                                                       
Why?
Because trying to use tools in cases,
where tools principally cannot and DO NOT adjust the costs of entry:

I love python.

I pray educators better explain the costs of tools, otherwise we get lost in these wish-to-get [PARALLEL]-schedules.

A few facts:

No.0: With a lot of simplification, python intentionally uses GIL to [SERIAL]-ise access to variables and thus avoiding any potential collision from [CONCURRENT] modifications - paying these add-on costs of GIL-stepped dancing in extra time

No.1: [PARALLEL]-code execution is way harder than a "just"-[CONCURRENT] ( read more )

No.2: [SERIAL]-process has to pay extra costs, if trying to split work onto [CONCURRENT]-workers
No.3: If a process does inter-worker communication, immense extra costs per data exchange are paid
No.4: If hardware has few resources for [CONCURRENT] processes, results get way worse further



To have some smell of what can be done in standard python 2.7.13:

Efficiency is in better using silicon, not in bulldozing syntax-constructors into territories, where they are legal, but their performance has adverse effects on the experiment-under-test end-to-end speed:

You pay about 8 ~ 11 [ms] just to iteratively assemble an empty array1

>>> from zmq import Stopwatch
>>> aClk = Stopwatch()
>>> aClk.start();array1 = [ 0 for i in xrange( 100000 ) ];aClk.stop()
 9751L
10146L
10625L
 9942L
10346L
 9359L
10473L
 9171L
 8328L


_{( the Stopwatch().stop() method yields [us] from .start() )}

while, the memory-efficient, vectorisable, GIL-free approach can do the same about +230x ~ +450x faster:

>>> import numpy as np
>>>
>>> aClk.start();arrayNP = np.zeros( 100000 );aClk.stop()
   15L
   22L
   21L
   23L
   19L
   22L

>>> aClk.start();arrayNP = np.zeros( 100000, dtype = np.int );aClk.stop()
   43L
   47L
   42L
   44L
   47L


So, using the proper tools just starts the story of performance:

>>> def test_SERIAL_python( nLOOPs = 100000 ):
...     aClk.start()
...     for i in xrange( nLOOPs ):           # py3 range() ~ xrange() in py27 
...         array1[i] = i**2                 # your loop-code
...     _ = aClk.stop()
...     return _


While a naive [SERIAL]-iterative implementation works, you pay immense costs for opting to do so ~ 70 [ms] for a 100000-D vector:

>>> test_SERIAL_python( nLOOPs = 100000 )
 70318L
 69211L
 77825L
 70943L
 74834L
 73079L


Using a more suitable / appropriate tool costs just ~ 0.2 [ms]
 i.e. ++350x FASTER

>>> aClk.start();arrayNP[:] = arrayNP[:]**2;aClk.stop()
189L
171L
173L
187L
183L
188L
193L


and with another glitch, a.k.a. an inplace modus-operandi:

>>> aClk.start();arrayNP[:] *=arrayNP[:];aClk.stop()
138L
139L
136L
137L
136L
136L
137L


Yields ~ +514x SPEEDUP, just from using appropriate tool

The art of performance is not in following marketing-sounding claims
about parallellizing-( at-any-cost ),
but in using know-how based methods, that pay least costs for biggest speedups achievable.

For "small"-problems, typical costs of distributing "thin"-work-packages are indeed hard to get covered by any potentially achievable speedups, so "problem-size" actually limits one's choice of methods, that could reach positive gain ( speedups of 0.9 or even << 1.0 are so often reported here, on StackOverflow, that you need not feel lost or alone in this sort of surprise ).



Epilogue

Processor number counts.

Core number counts.

But cache-sizes + NUMA-irregularities count more than that.

Smart, vectorised, HPC-cured, GIL-free libraries matter 
( numpy et al - thanks a lot Travis OLIPHANT & al ... Great Salute to his team ... )



As an overhead-strict Amdahl Law (re-)-formulation explains, why even many-N-CPU parallelised code execution may ( and indeed often does ) suffer from speedups << 1.0

Overhead-strict formulation of the Amdahl's Law speedup S includes the very costs of the paid [PAR]-Setup + [PAR]-Terminate Overheads, explicitly:

               1
S =  __________________________; where s, ( 1 - s ), N were defined above
                ( 1 - s )            pSO:= [PAR]-Setup-Overhead     add-on
     s  + pSO + _________ + pTO      pTO:= [PAR]-Terminate-Overhead add-on
                    N               


( an interactive animated tool for 2D visualising effects of these performance constraints is cited here )
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复

Why is the following simple parallelized code much slower than a simple loop in Python?

Why?Because trying to use tools in cases,where tools principally cannot and DO NOT adjust the costs of entry:

A few facts:

To have some smell of what can be done in standard python 2.7.13:

So, using the proper tools just starts the story of performance:

Using a more suitable / appropriate tool costs just ~ 0.2 [ms] i.e. ++350x FASTER

Yields ~ +514x SPEEDUP, just from using appropriate tool

Epilogue

Why?
Because trying to use tools in cases,
where tools principally cannot and DO NOT adjust the costs of entry:

Using a more suitable / appropriate tool costs just ~ 0.2 [ms]
i.e. ++350x FASTER