问题
Trying to remove duplicate based on unique values on column 'new', I have even tried two methods, but the output df.shape suggests before/after have the same df shape, meaning remove duplication fails.
import pandas
import numpy as np
import random
df = pandas.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df['new'] = [1, 1, 3, 4, 5, 1, 7, 8, 1, 10]
df['new2'] = [1, 1, 2, 4, 5, 3, 7, 8, 9, 5]
print df.shape
df.drop_duplicates('new', take_last=False)
df.groupby('new').max()
print df.shape
# output
(10, 6)
(10, 6)
[Finished in 1.0s]
回答1:
You need to assign the result of drop_duplicates, by default inplace=False
so it returns a copy of the modified df, as you don't pass param inplace=True
your original df is unmodified:
In [106]:
df = df.drop_duplicates('new', take_last=False)
df.groupby('new').max()
Out[106]:
A B C D new2
new
1 -1.698741 -0.550839 -0.073692 0.618410 1
3 0.519596 1.686003 1.395585 1.298783 2
4 1.557550 1.249577 0.214546 -0.077569 4
5 -0.183454 -0.789351 -0.374092 -1.824240 5
7 -1.176468 0.546904 0.666383 -0.315945 7
8 -1.224640 -0.650131 -0.394125 0.765916 8
10 -1.045131 0.726485 -0.194906 -0.558927 5
if you passed inplace=True
it would work:
In [108]:
df.drop_duplicates('new', take_last=False, inplace=True)
df.groupby('new').max()
Out[108]:
A B C D new2
new
1 0.334352 -0.355528 0.098418 -0.464126 1
3 -0.394350 0.662889 -1.012554 -0.004122 2
4 -0.288626 0.839906 1.335405 0.701339 4
5 0.973462 -0.818985 1.020348 -0.306149 5
7 -0.710495 0.580081 0.251572 -0.855066 7
8 -1.524862 -0.323492 -0.292751 1.395512 8
10 -1.164393 0.455825 -0.483537 1.357744 5
来源:https://stackoverflow.com/questions/32334695/remove-duplicate-method-for-python-pandas-doesnt-work