问题
I have the following dataframe:
topic student level week
1 a 1 1
1 b 2 1
1 a 3 1
2 a 1 2
2 b 2 2
2 a 3 2
2 b 4 2
The new dataframe should represent an interaction between students through the topic. It should contain four columns: "student source", "student destination", "week" and "reply count".
Student Destination is a student that each student shared the topic with.
Reply count is a number of times in which Student Destination "directly" replied to Student Source.
The new dataframe should look like:
st_source st_dest week reply_count
a b 1 1
a b 2 2
b a 1 1
b a 2 1
Reply count can be explained easier with an example.
If a thread is started by student A (by sending a message at level 1), B replied to A (sending a message at level 2), C replied to B (sending a message at level 3). Then B "directly" replied to A, and C "directly" replied to B, but C's reply to A is not direct (and so we don't count it).
Does anyone have some idea?
Thank you in advance!
回答1:
result = (df.groupby('week').apply(
lambda g: g.groupby([g.student.shift(), g.student])
.week.agg({'reply_count': 'count'})
.rename_axis(("st_source", "st_dest"))
).reset_index())
result[['st_source', 'st_dest', 'week', 'reply_count']].sort_values(['st_source', 'st_dest'])
# st_source st_dest week reply_count
#0 a b 1 1
#2 a b 2 2
#1 b a 1 1
#3 b a 2 1
来源:https://stackoverflow.com/questions/43742173/python-dataframe-interaction