Improve Pandas Merge performance

后端 未结 2 883
北海茫月
北海茫月 2020-12-03 03:01

I specifically dont have performace issue with Pands Merge, as other posts suggest, but I\'ve a class in which there are lot of methods, which does a lot of merge on dataset

2条回答
  •  遥遥无期
    2020-12-03 03:51

    set_index on merging column does indeed speed this up. Below is a slightly more realistic version of julien-marrec's Answer.

    import pandas as pd
    import numpy as np
    myids=np.random.choice(np.arange(10000000), size=1000000, replace=False)
    df1 = pd.DataFrame(myids, columns=['A'])
    df1['B'] = np.random.randint(0,1000,(1000000))
    df2 = pd.DataFrame(np.random.permutation(myids), columns=['A2'])
    df2['B2'] = np.random.randint(0,1000,(1000000))
    
    %%timeit
        x = df1.merge(df2, how='left', left_on='A', right_on='A2')   
    #1 loop, best of 3: 664 ms per loop
    
    %%timeit  
        x = df1.set_index('A').join(df2.set_index('A2'), how='left') 
    #1 loop, best of 3: 354 ms per loop
    
    %%time 
        df1.set_index('A', inplace=True)
        df2.set_index('A2', inplace=True)
    #Wall time: 16 ms
    
    %%timeit
        x = df1.join(df2, how='left')  
    #10 loops, best of 3: 80.4 ms per loop
    

    When the column to be joined has integers not in the same order on both tables you can still expect a great speed up of 8 times.

提交回复
热议问题