Pandas interpolate within a groupby

后端 未结 3 1851
误落风尘
误落风尘 2020-11-28 11:47

I\'ve got a dataframe with the following information:

    filename    val1    val2
t                   
1   file1.csv   5       10
2   file1.csv   NaN     Na         


        
相关标签:
3条回答
  • 2020-11-28 12:32
    >>> df.groupby('filename').apply(lambda group: group.interpolate(method='index'))
        filename  val1  val2
    t                       
    1  file1.csv     5    10
    2  file1.csv    10    15
    3  file1.csv    15    20
    6  file2.csv   NaN   NaN
    7  file2.csv    10    20
    8  file2.csv    12    15
    
    0 讨论(0)
  • 2020-11-28 12:51

    I ran into this as well. Instead of using apply, you can use transform, which will reduce your run time by more than 25% if you have on the order of 1000 groups:

    import numpy as np
    import pandas as pd
    
    np.random.seed(500)
    test_df = pd.DataFrame({
        'a': np.random.randint(low=0, high=1000, size=10000),
        'b': np.random.choice([1, 2, 4, 7, np.nan], size=10000, p=([0.2475]*4 + [0.01]))
    })
    

    Tests:

    %timeit test_df.groupby('a').transform(pd.DataFrame.interpolate)
    

    Output: 566 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    %timeit test_df.groupby('a').apply(pd.DataFrame.interpolate)
    

    Output: 788 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    %timeit test_df.groupby('a').apply(lambda group: group.interpolate())
    

    Output: 787 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    %timeit test_df.interpolate()
    

    Output: 918 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

    You will still see a significant increase in run-time compared to a fully vectorized call to interpolate on the full DataFrame, but I don't think you can do much better in pandas.

    0 讨论(0)
  • 2020-11-28 12:52

    Considering the long running time of above methods, I suggest use a for loop and interpolate(), which is no more than few lines of codes, but much faster in speed.

    for i in range(len(df.filename.unique())):
          mask = df.loc[:,'filename']==df.filename.unique()[i]
          df[mask]=dfs[mask].interpolate(method='index')
    
    0 讨论(0)
提交回复
热议问题