问题
I have a piece of code that runs but that is not scaling well with bigger dataset AT ALL. We are talking about minutes with big datasets. Here is a toy dataset to illustrate the issue:
Id Supplier Avg_NetAmountSpent Date Quantity NetAmount
0 185781 SAXON 2953.500000 2020-05-10 401 9294
1 185781 SAXON 2953.500000 2020-05-09 3502 8890
2 185781 SAXON 2953.500000 2020-05-08 7380 8381
3 185781 SAXON 2953.500000 2020-05-08 3384 1734
4 185781 SAXON 2953.500000 2020-05-08 4826 4910
612 467809 SAXONIS 861.666667 2020-05-09 314 1854
613 467809 SAXONIS 861.666667 2020-05-08 3347 727
614 467809 SAXONIS 861.666667 2020-05-08 4875 6744
615 467809 SAXONIS 861.666667 2020-05-10 3000 2754
616 467809 SAXONIS 861.666667 2020-05-10 7807 8763
And my code looks like this. The issue is coming from the last line.
today1 = pd.to_datetime('today').normalize()
frequency1 = '1D'
Nbin1 = (today1 - data_initial['Date'].min()) // pd.Timedelta(frequency1) + 1# Number of bins
bins1 = [today1 - n * pd.Timedelta(frequency1) for n in range(Nbin1, -1, -1)]
data11 = data_initial.groupby(['Id', pd.cut(data_initial['Date'], bins=bins1)]).sum().reset_index()
the output of the code of the following:
Id Date Quantity NetAmount
0 185781 (2020-05-07, 2020-05-08] 244780.0 372349.0
1 185781 (2020-05-08, 2020-05-09] 208825.0 364153.0
2 185781 (2020-05-09, 2020-05-10] 214401.0 314165.0
3 185781 (2020-05-10, 2020-05-11] NaN NaN
4 332146 (2020-05-07, 2020-05-08] 261302.0 450012.0
5 332146 (2020-05-08, 2020-05-09] 268464.0 498076.0
6 332146 (2020-05-09, 2020-05-10] 279866.0 432608.0
7 332146 (2020-05-10, 2020-05-11] NaN NaN
8 467809 (2020-05-07, 2020-05-08] 337624.0 568162.0
9 467809 (2020-05-08, 2020-05-09] 327044.0 496085.0
10 467809 (2020-05-09, 2020-05-10] 275298.0 383322.0
11 467809 (2020-05-10, 2020-05-11] NaN NaN
UPDATE: Some modifications made the code much faster to execute:
frequency1 = '1D'
data_initial['Period from Today'] = abs((pd.to_datetime(data_initial['Date'])-today1).dt.days)
data1 = data_initial.groupby(['Id', 'Date','Period from Today' ]).agg({'Quantity' : sum, 'NetAmount': sum}).reset_index()
resulting in the following output:
Id Date Period from Today Quantity NetAmount
0 285556 2020-05-06 5 218716 369230
1 285556 2020-05-07 4 345800 441942
2 285556 2020-05-08 3 226062 339148
3 443756 2020-05-06 5 782724 1187067
4 443756 2020-05-07 4 839970 1234092
5 443756 2020-05-08 3 929337 1049235
but it needs to look like that
Id Date Period from Today Quantity NetAmount
0 285556 2020-05-06 5 218716 369230
1 285556 2020-05-07 4 345800 441942
2 285556 2020-05-08 3 226062 339148
2 285556 2020-05-09 2 0 0
2 285556 2020-05-10 1 0 0
2 285556 2020-05-11 0 0 0
etc...
However there is still an issue there. How could I tweak what I to get it so the dates go up to the current date, rather than the most recent date in the dataset?
Thank you in advance for any help!
来源:https://stackoverflow.com/questions/61726593/problem-with-pandas-efficiency-when-working-with-dates