Pandas' equivalent of resample for integer index

前端 未结 3 488
花落未央
花落未央 2020-12-15 08:41

I\'m looking for a pandas equivalent of the resample method for a dataframe whose isn\'t a DatetimeIndex but an array of integers, or maybe even fl

相关标签:
3条回答
  • 2020-12-15 09:23

    Setup

    import pandas as pd
    import numpy as np
    
    np.random.seed([3,1415])
    df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])
    

    You need to create the labels to group by yourself. I'd use:

    (df.index.to_series() / 5).astype(int)
    

    To get you a series of values like [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...] Then use this in a groupby

    You'll also need to specify the index for the new dataframe. I'd use:

    df.index[4::5]
    

    To get a the current index starting at the 5th position (hence the 4) and every 5th position after that. It will look like [4, 9, 14, 19]. I could've done this as df.index[::5] to get the starting positions but I went with ending positions.

    Solution

    # assign as variable because I'm going to use it more than once.
    s = (df.index.to_series() / 5).astype(int)
    
    df.groupby(s).std().set_index(s.index[4::5])
    

    Looks like:

               A         B
    4   0.198019  0.320451
    9   0.329750  0.408232
    14  0.293297  0.223991
    19  0.095633  0.376390
    

    Other considerations

    This is for the equivalent of down sampling. We haven't addressed up sampling.

    To go back from what we've produced to a dataframe index by something more frequent, we can use reindex like so:

    # assign what we've done above to df_down
    df_down = df.groupby(s).std().set_index(s.index[4::5])
    
    df_up = df_down.reindex(range(20)).bfill()
    

    Looks like:

               A         B
    0   0.198019  0.320451
    1   0.198019  0.320451
    2   0.198019  0.320451
    3   0.198019  0.320451
    4   0.198019  0.320451
    5   0.329750  0.408232
    6   0.329750  0.408232
    7   0.329750  0.408232
    8   0.329750  0.408232
    9   0.329750  0.408232
    10  0.293297  0.223991
    11  0.293297  0.223991
    12  0.293297  0.223991
    13  0.293297  0.223991
    14  0.293297  0.223991
    15  0.095633  0.376390
    16  0.095633  0.376390
    17  0.095633  0.376390
    18  0.095633  0.376390
    19  0.095633  0.376390
    

    We could also use other things to reindex by like range(0, 20, 2) to up sample to even integer indices.

    0 讨论(0)
  • 2020-12-15 09:30

    @piSquared solution is really nice, but I don't like picking index per hand at reindexing.

    This should works too for each kind of downsampling (float index too) and automatically pick the mean of the index in each range:

    df = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])
    df.index.name = 'crazy_index'
    
    s = (df.index.to_series() / 10).astype(int)
    

    Now you can pick the function you want to calculate in each sub group at your will:

    # calculate std() in each group
    df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )
    
                        A         B
    crazy_index
    3.667539     0.276986  0.317642
    14.275074    0.248700  0.372551
    25.054042    0.254860  0.297586
    
    # calculate median() in each group
    df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )
    Out[38]:
                        A         B
    crazy_index
    3.667539     0.454654  0.521649
    14.275074    0.451265  0.490125
    25.054042    0.489326  0.622781
    

    EDIT : There were some errors in s indexing, now it is correct & working.

    0 讨论(0)
  • 2020-12-15 09:31

    Alternative, this is one thing that can be done

    def resample(df, rule, how=None, **kwargs):
        import pandas as pd
        if how==None:
            import numpy as np
            how = np.mean
    
        if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):
            return df.resample(rule, how, **kwargs)
        else:
            idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)
            aux = df.groupby(idx).apply(how)
            aux = aux.set_index(bins[:-1])
            return aux
    
    0 讨论(0)
提交回复
热议问题