Efficient way to merge multiple large DataFrames

后端未结

关注

 4  2023

Suppose I have 4 small DataFrames

df1, df2, df3 and df4

import pandas as pd
from functools import


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  爱一瞬间的悲伤        
                
              
                            
                2020-12-10 19:21
              
            
            
                                                                       
Seems like part of what dask dataframes were designed to do (out of memory ops with dataframes).  See 
Best way to join two large datasets in Pandas for example code.  Sorry not copying and pasting but don't want to seem like I am trying to take credit from answerer in linked entry.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  离开以前        
                
              
                            
                2020-12-10 19:26
              
            
            
                                                                       
You can try a simple for loop. The only memory optimization I have applied is downcasting to most optimal int type via pd.to_numeric.

I am also using a dictionary to store dataframes. This is good practice for holding a variable number of variables.

import pandas as pd

dfs = {}
dfs[1] = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
dfs[2] = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
dfs[3] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])  
dfs[4] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])   

df = dfs[1].copy()

for i in range(2, max(dfs)+1):
    df = pd.merge(df, dfs[i].rename(columns={2: i+1}),
                  left_on=[0, 1], right_on=[0, 1], how='outer').fillna(-1)
    df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')

print(df)

   0  1   2   3   4   5
0  a  1  10  15  -1  -1
1  a  2  20  20  -1  -1
2  b  1   4  -1  -1  -1
3  c  1   2   2  -1  -1
4  e  2  10  -1  20  20
5  d  1  -1  -1  10  10
6  f  1  -1  -1   1  15


You should not, as a rule, combine strings such as "missing" with numeric types, as this will turn your entire series into object type series. Here we use -1, but you may wish to use NaN with float dtype instead.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  [愿得一人]        
                
              
                            
                2020-12-10 19:30
              
            
            
                                                                       
So, you have 48 dfs with 3 columns each - name, id, and different column for every df.

You don`t must to use merge....

Instead, if you concat all the dfs

df = pd.concat([df1,df2,df3,df4])


You will recieve:

Out[3]: 
   id name  pricepart1  pricepart2  pricepart3  pricepart4
0   1    a        10.0         NaN         NaN         NaN
1   2    a        20.0         NaN         NaN         NaN
2   1    b         4.0         NaN         NaN         NaN
3   1    c         2.0         NaN         NaN         NaN
4   2    e        10.0         NaN         NaN         NaN
0   1    a         NaN        15.0         NaN         NaN
1   2    a         NaN        20.0         NaN         NaN
2   1    c         NaN         2.0         NaN         NaN
0   1    d         NaN         NaN        10.0         NaN
1   2    e         NaN         NaN        20.0         NaN
2   1    f         NaN         NaN         1.0         NaN
0   1    d         NaN         NaN         NaN        10.0
1   2    e         NaN         NaN         NaN        20.0
2   1    f         NaN         NaN         NaN        15.0


Now you can group by name and id and take the sum:

df.groupby(['name','id']).sum().fillna('missing').reset_index()


If you will try it with the 48 dfs you will see it solves the MemoryError:

dfList = []
#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):
    dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name',  'id',  'pricepart' + str(i + 1)]))

df = pd.concat(dfList)
df.groupby(['name','id']).sum().fillna('missing').reset_index()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2020-12-10 19:37
              
            
            
                                                                       
You may get some benefit from performing index-aligned concatenation using pd.concat. This should hopefully be faster and more memory efficient than an outer merge as well. 

df_list = [df1, df2, ...]
for df in df_list:
    df.set_index(['name', 'id'], inplace=True)

df = pd.concat(df_list, axis=1) # join='inner'
df.reset_index(inplace=True)


Alternatively, you can replace the concat (second step) by an iterative join:

from functools import reduce
df = reduce(lambda x, y: x.join(y), df_list)


This may or may not be better than the merge.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复