问题
Edit: the rookie mistake I made in string np.nan
having pointed out by @coldspeed, @wen-ben, @ALollz. Answers are quite good, so I don't delete this question to keep those answers.
Original:
I have read this question/answer What's the difference between groupby.first() and groupby.head(1)?
That answer explained that the differences are on handling NaN
value. However, when I call groupby
with as_index=False
, they both pick NaN
fine.
Furthermore, Pandas has groupby.nth
with similar functionality to head
, and first
What are difference of groupby.first(), groupby.nth(0), groupby.head(1)
with as_index=False
?
Example below:
In [448]: df
Out[448]:
A B
0 1 np.nan
1 1 4
2 1 14
3 2 8
4 2 19
5 2 12
In [449]: df.groupby('A', as_index=False).head(1)
Out[449]:
A B
0 1 np.nan
3 2 8
In [450]: df.groupby('A', as_index=False).first()
Out[450]:
A B
0 1 np.nan
1 2 8
In [451]: df.groupby('A', as_index=False).nth(0)
Out[451]:
A B
0 1 np.nan
3 2 8
I saw that `firs()' resets index while the other 2 doesn't. Besides that, is there any differences?
回答1:
The major issue is that you likely have the string 'np.nan'
stored and not a real null value. Here are how the three handle null
values differently:
Sample Data:
import pandas as pd
df = pd.DataFrame({'A': [1,1,2,2,3,3], 'B': [None, '1', np.NaN, '2', 3, 4]})
first
This will return the first non-null value within each group. Oddly enough it will not skip None
, though this can be made possible with the kwarg dropna=True
. As a result, you may return values for columns that were part of different rows originally:
df.groupby('A', as_index=False).first()
# A B
#0 1 None
#1 2 2
#2 3 3
df.groupby('A', as_index=False).first(dropna=True)
# A B
#0 1 1
#1 2 2
#2 3 3
head(n)
Returns the top n rows within a group. Values remain bound within rows. If you give it an n
that is more than the number of rows, it returns all rows in that group without complaining:
df.groupby('A', as_index=False).head(1)
# A B
#0 1 None
#2 2 NaN
#4 3 3
df.groupby('A', as_index=False).head(200)
# A B
#0 1 None
#1 1 1
#2 2 NaN
#3 2 2
#4 3 3
#5 3 4
nth
:
This takes the nth
row, so again values remain bound within the row. .nth(0)
is the same as .head(1)
, though they have different uses. For instance, if you need the 0th and 2nd row, that's difficult to do with .head()
, but easy with .nth([0,2])
. Also it's fair easier to write .head(10)
than .nth(list(range(10))))
.
df.groupby('A', as_index=False).nth(0)
# A B
#0 1 None
#2 2 NaN
#4 3 3
nth
also supports dropping rows with any null-values, so you can use it to return the first row without any null-values, unlike .head()
df.groupby('A', as_index=False).nth(0, dropna='any')
# A B
#A
#1 1 1
#2 2 2
#3 3 3
回答2:
Here is the different, you need to make the np.nan
to NaN
, in your original df it is string
, after convert it , you will see the different
df=df.mask(df=='np.nan')
df.groupby('A', as_index=False).head(1) #df.groupby('A', as_index=False).nth(0)
Out[8]:
A B
0 1 NaN
3 2 8
df.groupby('A', as_index=False).first()
# the reason why first have the index reset,
#since it will have chance select the value from different row within the group,
#when the first item is NaN it will skip it to find the first not null value
#rather than from the same row,
#If still keep the original row index will be misleading.
Out[9]:
A B
0 1 4
1 2 8
来源:https://stackoverflow.com/questions/55583246/what-is-different-between-groupby-first-groupby-nth-groupby-head-when-as-index