问题
Consider the following dataframe:
index count signal
1 1 1
2 1 NAN
3 1 NAN
4 1 -1
5 1 NAN
6 2 NAN
7 2 -1
8 2 NAN
9 3 NAN
10 3 NAN
11 3 NAN
12 4 1
13 4 NAN
14 4 NAN
I need to 'ffill' the NANs in 'signal' and values with different 'count' value should not affect each other. such that I should get the following dataframe:
index count signal
1 1 1
2 1 1
3 1 1
4 1 -1
5 1 -1
6 2 NAN
7 2 -1
8 2 -1
9 3 NAN
10 3 NAN
11 3 NAN
12 4 1
13 4 1
14 4 1
Right now I iterate through each data frame in group by object and fill NAN value and then copy to a new data frame:
new_table = np.array([]);
for key, group in df.groupby('count'):
group['signal'] = group['signal'].fillna(method='ffill')
group1 = group.copy()
if new_table.shape[0]==0:
new_table = group1
else:
new_table = pd.concat([new_table,group1])
which kinda works, but really slow considering the data frame is large. I am wondering if there is any other method to do it with or without groupby methods. Thanks!
EDITED:
Thanks to Alexander and jwilner for providing alternative methods. However both methods are very slow for my big dataframe which has 800,000 rows of data.
回答1:
Use the apply method.
In [56]: df = pd.DataFrame({"count": [1] * 4 + [2] * 5 + [3] * 2 , "signal": [1] + [None] * 4 + [-1] + [None] * 5})
In [57]: df
Out[57]:
count signal
0 1 1
1 1 NaN
2 1 NaN
3 1 NaN
4 2 NaN
5 2 -1
6 2 NaN
7 2 NaN
8 2 NaN
9 3 NaN
10 3 NaN
[11 rows x 2 columns]
In [58]: def ffill_signal(df):
....: df["signal"] = df["signal"].ffill()
....: return df
....:
In [59]: df.groupby("count").apply(ffill_signal)
Out[59]:
count signal
0 1 1
1 1 1
2 1 1
3 1 1
4 2 NaN
5 2 -1
6 2 -1
7 2 -1
8 2 -1
9 3 NaN
10 3 NaN
[11 rows x 2 columns]
However, be aware that groupby
reorders stuff. If the count column doesn't always stay the same or increase, but instead can have values repeated in it, groupby
might be problematic. That is, given a count
series like [1, 1, 2, 2, 1]
, groupby
will group like so: [1, 1, 1], [2, 2]
, which could have possibly undesirable effects on your forward filling. If that were undesired, you'd have to create a new series to use with groupby
that always stayed the same or increased according to changes in the count series -- probably using pd.Series.diff
and pd.Series.cumsum
回答2:
An alternative solution is to create a pivot table, forward fill values, and then map them back into the original DataFrame.
df2 = df.pivot(columns='count', values='signal', index='index').ffill()
df['signal'] = [df2.at[i, c]
for i, c in zip(df2.index, df['count'].tolist())]
>>> df
count index signal
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 -1
4 1 5 -1
5 2 6 NaN
6 2 7 -1
7 2 8 -1
8 3 9 NaN
9 3 10 NaN
10 3 11 NaN
11 4 12 1
12 4 13 1
13 4 14 1
With 800k rows of data, the efficacy of this approach depends on how many unique values are in 'count'.
Compared to my prior answer:
%%timeit
for c in df['count'].unique():
df.loc[df['count'] == c, 'signal'] = df[df['count'] == c].ffill()
100 loops, best of 3: 4.1 ms per loop
%%timeit
df2 = df.pivot(columns='count', values='signal', index='index').ffill()
df['signal'] = [df2.at[i, c] for i, c in zip(df2.index, df['count'].tolist())]
1000 loops, best of 3: 1.32 ms per loop
Lastly, you can simply use groupby
, although it is slower than the previous method:
df.groupby('count').ffill()
Out[191]:
index signal
0 1 1
1 2 1
2 3 1
3 4 -1
4 5 -1
5 6 NaN
6 7 -1
7 8 -1
8 9 NaN
9 10 NaN
10 11 NaN
11 12 1
12 13 1
13 14 1
%%timeit
df.groupby('count').ffill()
100 loops, best of 3: 3.55 ms per loop
回答3:
I know it's very late, but I found a solution that is much faster than those proposed, namely to collect the updated dataframes in a list and do the concatenation only at the end. To take your example:
new_table = []
for key, group in df.groupby('count'):
group['signal'] = group['signal'].fillna(method='ffill')
group1 = group.copy()
if new_table.shape[0]==0:
new_table = [group1]
else:
new_table.append(group1)
new_table = pd.concat(new_table).reset_index(drop=True)
回答4:
Assuming the data has been pre-sorted on df['index'], try using loc
instead:
for c in df['count'].unique():
df.loc[df['count'] == c, 'signal'] = df[df['count'] == c].ffill()
>>> df
index count signal
0 1 1 1
1 2 1 1
2 3 1 1
3 4 1 -1
4 5 1 -1
5 6 2 NaN
6 7 2 -1
7 8 2 -1
8 9 3 NaN
9 10 3 NaN
10 11 3 NaN
11 12 4 1
12 13 4 1
13 14 4 1
来源:https://stackoverflow.com/questions/30290377/edit-dataframe-entries-using-groupby-object-pandas