I have a dataframe where each row contains various meta-data pertaining to a single Reddit comment (e.g. author, subreddit, comment text).
I want to do the following
Here are two strategies to do it. No doubt, there are other ways.
Assuming your dataframe looks something like this (obviously with more columns):
df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']})
>>> df
author subreddit
0 a sr1
1 a sr2
2 b sr2
...
SOLUTION 1: groupby
More straightforward than solution 2, and similar to your first attempt:
group = df.groupby('author')
df2 = group.apply(lambda x: x['subreddit'].unique())
# Alternatively, same thing as a one liner:
# df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique())
Result:
>>> df2
author
a [sr1, sr2]
b [sr2]
The author is the index, and the single column is the list of all subreddits they are active in (this is how I interpreted how you wanted your output, according to your description).
If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:
df2 = df2.apply(pd.Series)
Result:
>>> df2
0 1
author
a sr1 sr2
b sr2 NaN
Solution 2: Iterate through dataframe
you can make a new dataframe with all unique authors:
df2 = pd.DataFrame({'author':df.author.unique()})
And then just get the list of all unique subreddits they are active in, assigning it to a new column:
df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']]))
for _, x in df2.iterrows()]
This gives you this:
>>> df2
author subreddits
0 a [sr2, sr1]
1 b [sr2]