I\'m looking for a pandas equivalent of the resample
method for a dataframe whose isn\'t a DatetimeIndex
but an array of integers, or maybe even fl
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])
You need to create the labels to group by yourself. I'd use:
(df.index.to_series() / 5).astype(int)
To get you a series of values like [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...]
Then use this in a groupby
You'll also need to specify the index for the new dataframe. I'd use:
df.index[4::5]
To get a the current index starting at the 5th position (hence the 4
) and every 5th position after that. It will look like [4, 9, 14, 19]
. I could've done this as df.index[::5]
to get the starting positions but I went with ending positions.
# assign as variable because I'm going to use it more than once.
s = (df.index.to_series() / 5).astype(int)
df.groupby(s).std().set_index(s.index[4::5])
Looks like:
A B
4 0.198019 0.320451
9 0.329750 0.408232
14 0.293297 0.223991
19 0.095633 0.376390
This is for the equivalent of down sampling. We haven't addressed up sampling.
To go back from what we've produced to a dataframe index by something more frequent, we can use reindex
like so:
# assign what we've done above to df_down
df_down = df.groupby(s).std().set_index(s.index[4::5])
df_up = df_down.reindex(range(20)).bfill()
Looks like:
A B
0 0.198019 0.320451
1 0.198019 0.320451
2 0.198019 0.320451
3 0.198019 0.320451
4 0.198019 0.320451
5 0.329750 0.408232
6 0.329750 0.408232
7 0.329750 0.408232
8 0.329750 0.408232
9 0.329750 0.408232
10 0.293297 0.223991
11 0.293297 0.223991
12 0.293297 0.223991
13 0.293297 0.223991
14 0.293297 0.223991
15 0.095633 0.376390
16 0.095633 0.376390
17 0.095633 0.376390
18 0.095633 0.376390
19 0.095633 0.376390
We could also use other things to reindex
by like range(0, 20, 2)
to up sample to even integer indices.
@piSquared solution is really nice, but I don't like picking index per hand at reindexing.
This should works too for each kind of downsampling (float index too) and automatically pick the mean of the index in each range:
df = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])
df.index.name = 'crazy_index'
s = (df.index.to_series() / 10).astype(int)
Now you can pick the function you want to calculate in each sub group at your will:
# calculate std() in each group
df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )
A B
crazy_index
3.667539 0.276986 0.317642
14.275074 0.248700 0.372551
25.054042 0.254860 0.297586
# calculate median() in each group
df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )
Out[38]:
A B
crazy_index
3.667539 0.454654 0.521649
14.275074 0.451265 0.490125
25.054042 0.489326 0.622781
EDIT : There were some errors in s indexing, now it is correct & working.
Alternative, this is one thing that can be done
def resample(df, rule, how=None, **kwargs):
import pandas as pd
if how==None:
import numpy as np
how = np.mean
if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):
return df.resample(rule, how, **kwargs)
else:
idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)
aux = df.groupby(idx).apply(how)
aux = aux.set_index(bins[:-1])
return aux