问题
I have two dataframes, one with the customers who prefer songs, and my other dataframe consists of users and their cluster.
DATA 1:
user song
A 11
A 22
B 99
B 11
C 11
D 44
C 66
E 66
D 33
E 55
F 11
F 77
DATA 2:
user cluster
A 1
B 2
C 3
D 1
E 2
F 3
Using above data sets, I was able to achieve what all songs are listened by users of that cluster.
cluster songs
1 [11, 22, 33, 44]
2 [11, 99, 66, 55]
3 [11,66,88,77]
I need to assign the song of a particular cluster to that particular user who has not listened to it yet. In my expected output A belongs to cluster 1, and he has not yet listened to song 33 and 44..so my output should be like below. Same for B, which belongs to cluster 2, B has not listen to 66 and 55 songs, output for B looks like below.
EXPECTED OUTPUT :
user song
A [33, 44]
B [66,55]
C [77]
D [11,22]
E [11,99]
F [66]
回答1:
Not easy:
#add column and remove duplicates
df = pd.merge(df1, df2, on='user', how='left').drop_duplicates(['user','song'])
def f(x):
#for each group reshape
x = x.pivot('user','song','cluster')
#get all columns values if NaNs in data
x = x.apply(lambda x: x.index[x.isnull()].tolist(),1)
return x
df1 = df.groupby(['cluster']).apply(f).reset_index(level=0, drop=True).sort_index()
user
A [33, 44]
B [55, 66]
C [77]
D [11, 22]
E [11, 99]
F [66]
dtype: object
Similar solution:
df = pd.merge(df1, df2, on='user', how='left').drop_duplicates(['user','song'])
df1 = (df.groupby(['cluster']).apply(lambda x: x.pivot('user','song','cluster').isnull())
.fillna(False)
.reset_index(level=0, drop=True)
.sort_index())
#replace each True by value of column
s = np.where(df1, ['{}, '.format(x) for x in df1.columns.astype(str)], '')
#remove empty values
s1 = pd.Series([''.join(x).strip(', ') for x in s], index=df1.index)
print (s1)
user
A 33, 44
B 55, 66
C 77
D 11, 22
E 11, 99
F 66
dtype: object
回答2:
Use sets for comparison.
Setup
df1
# user song
# 0 A 11
# 1 A 22
# 2 B 99
# 3 B 11
# 4 C 11
# 5 D 44
# 6 C 66
# 7 E 66
# 8 D 33
# 9 E 55
# 10 F 11
# 11 F 77
df2
# user cluster
# 0 A 1
# 1 B 2
# 2 C 3
# 3 D 1
# 4 E 2
# 5 F 3
df3
# cluster songs
# 0 1 [11, 22, 33, 44]
# 1 2 [11, 99, 66, 55]
# 2 3 [11, 66, 88, 77]
Calculation
df = df1.groupby('user')['song'].apply(set)\
.reset_index().rename(columns={'song': 'heard'})
df['all'] = df['user'].map(df2.set_index('user')['cluster'])\
.map(df3.set_index('cluster')['songs'])\
.map(set)
df['not heard'] = df.apply(lambda row: row['all'] - row['heard'], axis=1)
Result
user heard all not heard
0 A {11, 22} {33, 11, 44, 22} {33, 44}
1 B {11, 99} {99, 66, 11, 55} {66, 55}
2 C {66, 11} {88, 66, 11, 77} {88, 77}
3 D {33, 44} {33, 11, 44, 22} {11, 22}
4 E {66, 55} {99, 66, 11, 55} {11, 99}
5 F {11, 77} {88, 66, 11, 77} {88, 66}
Extract any columns you need; conversion to list is trivial, i.e. df[col] = df[col].map(list).
Explanation
There are 3 steps:
- Convert lists to sets and aggregate heard songs by user to sets.
- Perform mappings to put all data in one table.
- Add a column which calculates the difference between 2 sets.
来源:https://stackoverflow.com/questions/48927671/assigning-the-value-to-a-user-depending-on-the-cluster-he-comes-from