Efficient way to merge multiple large DataFrames

后端未结

关注

 4  2020

臣服心动 2020-12-10 18:51

Suppose I have 4 small DataFrames

df1, df2, df3 and df4

import pandas as pd
from functools import


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   [愿得一人]
                                             
                
                
                (楼主)
            
              
              
                2020-12-10 19:30
              

            
            
                        
So, you have 48 dfs with 3 columns each - name, id, and different column for every df.

You don`t must to use merge....

Instead, if you concat all the dfs

df = pd.concat([df1,df2,df3,df4])


You will recieve:

Out[3]: 
   id name  pricepart1  pricepart2  pricepart3  pricepart4
0   1    a        10.0         NaN         NaN         NaN
1   2    a        20.0         NaN         NaN         NaN
2   1    b         4.0         NaN         NaN         NaN
3   1    c         2.0         NaN         NaN         NaN
4   2    e        10.0         NaN         NaN         NaN
0   1    a         NaN        15.0         NaN         NaN
1   2    a         NaN        20.0         NaN         NaN
2   1    c         NaN         2.0         NaN         NaN
0   1    d         NaN         NaN        10.0         NaN
1   2    e         NaN         NaN        20.0         NaN
2   1    f         NaN         NaN         1.0         NaN
0   1    d         NaN         NaN         NaN        10.0
1   2    e         NaN         NaN         NaN        20.0
2   1    f         NaN         NaN         NaN        15.0


Now you can group by name and id and take the sum:

df.groupby(['name','id']).sum().fillna('missing').reset_index()


If you will try it with the 48 dfs you will see it solves the MemoryError:

dfList = []
#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):
    dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name',  'id',  'pricepart' + str(i + 1)]))

df = pd.concat(dfList)
df.groupby(['name','id']).sum().fillna('missing').reset_index()

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复