Drop duplicate list elements in column of lists

北慕城南 提交于 2020-08-27 06:40:41

问题


This is my dataframe:

pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
              'B':[0, 2, 3, 4, 5, 6, 7],
              'C':[[1,4,4,4], [1,4,4,4], [3,4,4,5], [3,4,4,5], [4,4,2,1], [1,2,3,4,], [7,8,9,1]]})

I want to get set\drop duplicate values of column C per row but not drop duplicate rows.

This what I hope to get:

pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
              'B':[0, 2, 3, 4, 5, 6, 7],
              'C':[[1,4], [1,4], [3,4,5], [3,4,5], [4,2,1], [1,2,3,4,], [7,8,9,1]]})

回答1:


If you're using python 3.7>, you could could map with dict.fromkeys, and obtain a list from the dictionary keys (the version is relevant since insertion order is maintained starting from there):

df['C'] = df.C.map(lambda x: list(dict.fromkeys(x).keys()))

For older pythons you have collections.OrderedDict:

from collections import OrderedDict
df['c']= df.C.map(lambda x: list(OrderedDict.fromkeys(x).keys()))

print(df)

   A  B             C
0  1  0        [1, 4]
1  3  2        [1, 4]
2  3  3     [3, 4, 5]
3  4  4     [3, 4, 5]
4  5  5     [4, 2, 1]
5  3  6  [1, 2, 3, 4]
6  3  7  [7, 8, 9, 1]

As mentioned by cs95 in the comments, if we don't need to preserve order we could go with a set for a more concise approach:

df['c'] = df.C.map(lambda x: [*{*x}])

Since several approaches have been proposed and is hard to tell how they will perform on large dataframes, probably worth benchmarking:

df = pd.concat([df]*50000, axis=0).reset_index(drop=True)

perfplot.show(
    setup=lambda n: df.iloc[:int(n)], 

    kernels=[
        lambda df: df.C.map(lambda x: list(dict.fromkeys(x).keys())),
        lambda df: df['C'].map(lambda x: pd.factorize(x)[1]),
        lambda df: [np.unique(item) for item in df['C'].values],
        lambda df: df['C'].explode().groupby(level=0).unique(),
        lambda df: df.C.map(lambda x: [*{*x}]),
    ],

    labels=['dict.from_keys', 'factorize', 'np.unique', 'explode', 'set'],
    n_range=[2**k for k in range(0, 18)],
    xlabel='N',
    equality_check=None
)




回答2:


if order is of no importance you could cast the column to a numpy array and apply an operation on each row in a list comprehension.

import numpy as np
df['C_Unique'] = [np.unique(item) for item in df['C'].values]

print(df)

   A  B             C      C_Unique
0  1  0  [1, 4, 4, 4]        [1, 4]
1  3  2  [1, 4, 4, 4]        [1, 4]
2  3  3  [3, 4, 4, 5]     [3, 4, 5]
3  4  4  [3, 4, 4, 5]     [3, 4, 5]
4  5  5  [4, 4, 2, 1]     [1, 2, 4]
5  3  6  [1, 2, 3, 4]  [1, 2, 3, 4]
6  3  7  [7, 8, 9, 1]  [1, 7, 8, 9]

Another method would be to to use explode and groupby.unique

df['CExplode'] = df['C'].explode().groupby(level=0).unique()

  A  B             C      C_Unique      CExplode
0  1  0        [1, 4]        [1, 4]        [1, 4]
1  3  2        [1, 4]        [1, 4]        [1, 4]
2  3  3     [3, 4, 5]     [3, 4, 5]     [3, 4, 5]
3  4  4     [3, 4, 5]     [3, 4, 5]     [3, 4, 5]
4  5  5     [4, 2, 1]     [1, 2, 4]     [4, 2, 1]
5  3  6  [1, 2, 3, 4]  [1, 2, 3, 4]  [1, 2, 3, 4]
6  3  7  [7, 8, 9, 1]  [1, 7, 8, 9]  [7, 8, 9, 1]



回答3:


You can use apply function in pandas.

df['C'] = df['C'].apply(lambda x: list(set(x)))



回答4:


map and factorize

Let's throw one more into the mix.

df['C'].map(pd.factorize).str[1]

0          [1, 4]
1          [1, 4]
2       [3, 4, 5]
3       [3, 4, 5]
4       [4, 2, 1]
5    [1, 2, 3, 4]
6    [7, 8, 9, 1]
Name: C, dtype: object

Or,

df['C'].map(lambda x: pd.factorize(x)[1])

0          [1, 4]
1          [1, 4]
2       [3, 4, 5]
3       [3, 4, 5]
4       [4, 2, 1]
5    [1, 2, 3, 4]
6    [7, 8, 9, 1]
Name: C, dtype: object


来源:https://stackoverflow.com/questions/62872266/drop-duplicate-list-elements-in-column-of-lists

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!