pandas get unique values from column of lists

一曲冷凌霜 提交于 2021-02-07 12:37:41

问题


How do I get the unique values of a column of lists in pandas or numpy such that second column from

would result in 'action', 'crime', 'drama'.

The closest (but non-functional) solutions I could come up with were:

 genres = data['Genre'].unique()

But this predictably results in a TypeError saying how lists aren't hashable.

TypeError: unhashable type: 'list'

Set seemed to be a good idea but

genres = data.apply(set(), columns=['Genre'], axis=1)

but also results in a TypeError: set() takes no keyword arguments


回答1:


If you only want to find the unique values, I'd recommend using itertools.chain.from_iterable to concatenate all those lists

import itertools

>>> np.unique([*itertools.chain.from_iterable(df.Genre)])
array(['action', 'crime', 'drama'], dtype='<U6')

Or even faster

>>> set(itertools.chain.from_iterable(df.Genre))
{'action', 'crime', 'drama'}

Timings

df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})
df = pd.concat([df]*10000)

%timeit set(itertools.chain.from_iterable(df.Genre))
100 loops, best of 3: 2.55 ms per loo
    
%timeit set([x for y in df['Genre'] for x in y])
100 loops, best of 3: 4.09 ms per loop

%timeit np.unique([*itertools.chain.from_iterable(df.Genre)])
100 loops, best of 3: 12.8 ms per loop

%timeit np.unique(df['Genre'].sum())
1 loop, best of 3: 1.65 s per loop

%timeit set(df['Genre'].sum())
1 loop, best of 3: 1.66 s per loop



回答2:


You can use explode:

data = pd.DataFrame([
    {
        "title": "The Godfather: Part II",
        "genres": ["crime", "drama"],
        "director": "Fracis Ford Coppola"
    },
    {
        "title": "The Dark Knight",
        "genres": ["action", "crime", "drama"],
        "director": "Christopher Nolan"
    }
])
# Changed from data.explode("genres")["genres"].unique() as suggested by rafaelc
data["genres"].explode().unique() 

Results in:

array(['crime', 'drama', 'action'], dtype=object)



回答3:


Here are some options:

# toy data
df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})

np.unique(df['Genre'].sum())
# 109 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

set(df['Genre'].sum())
# 87 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

set([x  for y in df['Genre'] for x in y])
# 11.8 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)



回答4:


If you're just looking to extract the information and not add back to the DataFrame, you can utilize Python's set method in a for loop:

import pandas as pd
df = pd.DataFrame({'movie':[[1,2,3],[1,2,6]]})
out = set()
for row in df['movie']:
    out.update({item for item in row})
print(out)

You could also wrap this in an apply call if you wanted (which would return None but update the set in place):

out = set()
df['movie'].apply(lambda x: out.update({item for item in x}))

Personally I think the for loop is a bit clearer to read.




回答5:


Not sure if it's exactly what you wanted, but this will allow you to convert it into a set.

import pandas as pd
import numpy as np

df = pd.DataFrame({'Movie':['The Godfather', 'Dark Knight'], 'Genre': [['Crime', 'Drama'],['Crime', 'Drama', 'Action']]})

genres = []
for sublist in df['Genre']:
    for item in sublist:
        genres.append(item)

genre_set = set(genres)

print(genre_set)

Output: {'Action', 'Drama', 'Crime'}




回答6:


Use the power of sets for chained uniqueness. I've used this technique with huge lists, in big data like envs'. The main pro here is cut down the time needed to produce a final flat list.

  1. Convert the list-column into sets
  2. Reduce all sets into a final set, using union

Try:

from functools import reduce # for python 3

l = df.Genre.dropna().tolist()
sets = [ set(i) for i in l ]
final_set = reduce(lambda x, y: x.union(y), sets)
  • In big-data like envs', like spark, use map to convert each list into a set, then reduce like the above.
  • Change union to intersection, if you need to get all common values from all lists.


来源:https://stackoverflow.com/questions/58528989/pandas-get-unique-values-from-column-of-lists

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!