Pandas, for each unique value in one column, get unique values in another column

后端未结

关注

 3  516

情话喂你 2020-12-25 08:01

I have a dataframe where each row contains various meta-data pertaining to a single Reddit comment (e.g. author, subreddit, comment text).

I want to do the following

3条回答

心在旅途 (楼主)

2020-12-25 09:00
Here are two strategies to do it. No doubt, there are other ways.

Assuming your dataframe looks something like this (obviously with more columns):
```
df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']})

>>> df
  author subreddit
0      a       sr1
1      a       sr2
2      b       sr2
...
```
SOLUTION 1: groupby

More straightforward than solution 2, and similar to your first attempt:
```
group = df.groupby('author')

df2 = group.apply(lambda x: x['subreddit'].unique())

# Alternatively, same thing as a one liner:
# df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique())
```
Result:
```
>>> df2
author
a    [sr1, sr2]
b         [sr2]
```
The author is the index, and the single column is the list of all subreddits they are active in (this is how I interpreted how you wanted your output, according to your description).

If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:
```
df2 = df2.apply(pd.Series)
```
Result:
```
>>> df2
          0    1
author          
a       sr1  sr2
b       sr2  NaN
```
Solution 2: Iterate through dataframe

you can make a new dataframe with all unique authors:
```
df2 = pd.DataFrame({'author':df.author.unique()})
```
And then just get the list of all unique subreddits they are active in, assigning it to a new column:
```
df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']])) 
    for _, x in df2.iterrows()]
```
This gives you this:
```
>>> df2
  author  subreddits
0      a  [sr2, sr1]
1      b       [sr2]
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...