问题
I tried to groupby a df and then select groups who do not have the same value on a specific column and whose group size > 1,
df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))
the df looks like, note that some account_no strings only have a single space, ext_id and int_id are also strings, amount is float;
account_no ext_id amount int_id
2665057 439.504062 D000192
2665057 439.504062 D000192
353724 2758.92 952
353724 2758.92 952
the code supposed to return an empty df, since none of the rows in the sample satisfy the conditions here, but the rows with int_id = 292 remained, so how to fix the issue here.
ps. numpy 1.14.3, pandas 0.22.0, python 3.5.2
回答1:
In my opinion there is problem some traling whitespace or similar.
You can check it:
df = pd.DataFrame({'account_no': ['a', 'a', 'a', 'a'],
'ext_id': [2665057, 2665057, 353724, 353724],
'amount': [439.50406200000003, 439.50406200000003, 2758.92, 2758.92],
'int_id': ['D000192', 'D000192', ' 952', '952']})
print (df)
account_no amount ext_id int_id
0 a 439.504062 2665057 D000192
1 a 439.504062 2665057 D000192
2 a 2758.920000 353724 952
3 a 2758.920000 353724 952
df1 = df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))
print (df1)
account_no amount ext_id int_id
2 a 2758.92 353724 952
3 a 2758.92 353724 952
print (df1['int_id'].tolist())
[' 952', '952']
And then remove it by str.strip:
df['int_id'] = df['int_id'].str.strip()
df1 = df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))
print (df1)
Empty DataFrame
Columns: [account_no, amount, ext_id, int_id]
Index: []
来源:https://stackoverflow.com/questions/50293962/numpy-unique-could-not-filter-out-groups-with-the-same-value-on-a-specific-colum