问题
I have written the following code which creates two dataframes nq
and cmnt
.nq
contains the UserId
and corresponding time of Badge Attainment date
.cmnt
contains OwnerUserId
and the time when the User made a comment CreationDate
.
I want to get a count of the comments made for all days before and after 1 week of badge attainment so that I can create a time series line plot out of it.
The following code perform the same but produces a KeyError. Please provide a code that performs this operations for all users.
nq
UserId | date
1 2009-10-17 17:38:32.590
2 2009-10-19 00:37:23.067
3 2009-10-20 08:37:14.143
4 2009-10-21 18:07:51.247
5 2009-10-22 21:25:24.483
cmnt
OwnerUserId | CreationDate
1 2009-10-16 17:38:32.590
1 2009-10-18 17:38:32.590
2 2009-10-18 00:37:23.067
2 2009-10-17 00:37:23.067
2 2009-10-20 00:37:23.067
3 2009-10-19 08:37:14.143
4 2009-10-20 18:07:51.247
5 2009-10-21 21:25:24.483
code
nq.date = pd.to_datetime(nq.date)
cmnt.CreationDate = pd.to_datetime(cmnt.CreationDate)
count= []
for j in range(len(nq)):
for i in range(-7,8):
check_date = nq.date.iloc[j] + timedelta(days=i)
count = cmnt.loc[(cmnt.OwnerUserId == nq.UserId.iloc[j]) & (cmnt.CreationDate == check_date)].shape[0]
nq.iloc[j].append({nq[i]:count})
expected output
UserId | date |-7|-6|-5|-4|-3|-2|-1|0 |1 |2 |3 |4 |5 |6 |7
1 2009-10-17 17:38:32.590 |0 |0 |0 |0 |0 |0 |1 |0 |1 |0 |0 |0 |0 |0 |0
2 2009-10-19 00:37:23.067 |0 |0 |0 |0 |0 |1 |1 |0 |1 |0 |0 |0 |0 |0 |0
3 2009-10-20 08:37:14.143 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0
4 2009-10-21 18:07:51.247 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0
5 2009-10-22 21:25:24.483 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0
Here column -1
means comment made 1 day before badge attainment and 1
means comment made one day after badge attainment and so on.
Note There can be a completely alternately way to do this. My main objective is to draw a time series line plot which shows the number of comments made by the users before and after attainment of the badge.
回答1:
Here's a way to do that:
t = pd.merge(nq, cmnt, left_on="UserId", right_on = "OwnerUserId")
t["days_diff"] = (t["CreationDate"] - t["date"]).dt.days
t["count"] = t.groupby(["UserId", "days_diff"]).OwnerUserId.transform("count")
all_days = pd.DataFrame(itertools.product(t.UserId.unique(), range(-7, 8)), )
all_days.columns = ["UserId", "day"]
t = pd.merge(t, all_days, left_on=["UserId", "days_diff"], right_on=["UserId", "day"], how = "right")
t = pd.pivot_table(t, index="UserId", columns="day", values="count", dropna=False)
res = pd.merge(nq, t, left_on="UserId", right_index=True)
print(res)
The output is:
UserId date -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
0 1 2009-10-17 17:38:32.590 NaN NaN NaN NaN NaN NaN 1.0 NaN 1.0 NaN NaN NaN NaN NaN NaN
1 2 2009-10-19 00:37:23.067 NaN NaN NaN NaN NaN 1.0 1.0 NaN 1.0 NaN NaN NaN NaN NaN NaN
2 3 2009-10-20 08:37:14.143 NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
3 4 2009-10-21 18:07:51.247 NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 5 2009-10-22 21:25:24.483 NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
来源:https://stackoverflow.com/questions/62958702/how-to-get-count-of-values-based-on-datetime-in-python