问题
i trying to find out if there is away to remove duplicate in my data frame while concatenating the value
example:
df
key v1 v2
0 1 n/a a
1 2 n/a b
2 3 n/a c
3 2 n/a d
4 3 n/a e
the out put should be like:
df_out
key v1 v2
0 1 n/a a
1 2 n/a b,d
2 3 n/a c,e
I try using df.drop_duplicates() and some loop to save the v2 column value and nothing yet. i'm trying to do it nice and clean with out loop by using Pandas.
some one know a way pandas can do it?
回答1:
This should be easy, assuming you have two columns. Use groupby
+ agg
. v1
should be aggregated by first
, and v2
should be aggregated by ','.join
.
df
key v1 v2
0 1 NaN a
1 2 NaN b
2 3 NaN c
3 2 NaN d
4 3 NaN e
(df.groupby('key')
.agg({'v1' : 'first', 'v2' : ','.join})
.reset_index()
.reindex(columns=df.columns))
key v1 v2
0 1 NaN a
1 2 NaN b,d
2 3 NaN c,e
If you have multiple such columns requiring the same aggregation, build an agg dict called f
and pass it to agg
.
回答2:
Using set
df.groupby('key').agg(lambda x : ','.join(set(x)))
Out[1255]:
v1 v2
key
1 n/a a
2 n/a b,d
3 n/a c,e
回答3:
Use apply
pandas.core.groupby.GroupBy.apply
GroupBy.apply(func, *args, **kwargs)[source]
Apply function func group-wise and combine the results together.
df.groupby(["key", "v1"])["v2"].apply(list) # or apply(set) depending on your needs
Output:
key v1
1 n/a [a]
2 n/a [b, d]
3 n/a [c, e]
Name: v2, dtype: object
来源:https://stackoverflow.com/questions/47980402/comma-separated-values-from-pandas-groupby