problem with pandas efficiency when working with dates

问题

I have a piece of code that runs but that is not scaling well with bigger dataset AT ALL. We are talking about minutes with big datasets. Here is a toy dataset to illustrate the issue:

         Id Supplier  Avg_NetAmountSpent            Date  Quantity  NetAmount
0    185781    SAXON         2953.500000      2020-05-10       401       9294
1    185781    SAXON         2953.500000      2020-05-09      3502       8890
2    185781    SAXON         2953.500000      2020-05-08      7380       8381
3    185781    SAXON         2953.500000      2020-05-08      3384       1734
4    185781    SAXON         2953.500000      2020-05-08      4826       4910
612  467809    SAXONIS         861.666667      2020-05-09       314       1854
613  467809    SAXONIS         861.666667      2020-05-08      3347        727
614  467809    SAXONIS         861.666667      2020-05-08      4875       6744
615  467809    SAXONIS         861.666667      2020-05-10      3000       2754
616  467809    SAXONIS         861.666667      2020-05-10      7807       8763

And my code looks like this. The issue is coming from the last line.

today1 = pd.to_datetime('today').normalize()
frequency1 = '1D'
Nbin1 = (today1 - data_initial['Date'].min()) // pd.Timedelta(frequency1) + 1# Number of bins
bins1 = [today1 - n * pd.Timedelta(frequency1) for n in range(Nbin1, -1, -1)]

data11 = data_initial.groupby(['Id', pd.cut(data_initial['Date'], bins=bins1)]).sum().reset_index()

the output of the code of the following:

        Id                      Date  Quantity  NetAmount  
0   185781  (2020-05-07, 2020-05-08]  244780.0   372349.0        
1   185781  (2020-05-08, 2020-05-09]  208825.0   364153.0        
2   185781  (2020-05-09, 2020-05-10]  214401.0   314165.0      
3   185781  (2020-05-10, 2020-05-11]  NaN        NaN               
4   332146  (2020-05-07, 2020-05-08]  261302.0   450012.0   
5   332146  (2020-05-08, 2020-05-09]  268464.0   498076.0   
6   332146  (2020-05-09, 2020-05-10]  279866.0   432608.0    
7   332146  (2020-05-10, 2020-05-11]  NaN        NaN    
8   467809  (2020-05-07, 2020-05-08]  337624.0   568162.0
9   467809  (2020-05-08, 2020-05-09]  327044.0   496085.0 
10  467809  (2020-05-09, 2020-05-10]  275298.0   383322.0 
11  467809  (2020-05-10, 2020-05-11]  NaN        NaN

UPDATE: Some modifications made the code much faster to execute:

frequency1 = '1D'

data_initial['Period from Today'] = abs((pd.to_datetime(data_initial['Date'])-today1).dt.days)

data1 = data_initial.groupby(['Id', 'Date','Period from Today' ]).agg({'Quantity' : sum, 'NetAmount': sum}).reset_index()

resulting in the following output:

       Id       Date  Period from Today  Quantity  NetAmount
0  285556 2020-05-06                  5    218716     369230
1  285556 2020-05-07                  4    345800     441942
2  285556 2020-05-08                  3    226062     339148
3  443756 2020-05-06                  5    782724    1187067
4  443756 2020-05-07                  4    839970    1234092
5  443756 2020-05-08                  3    929337    1049235

but it needs to look like that

       Id       Date  Period from Today  Quantity  NetAmount
0  285556 2020-05-06                  5    218716     369230
1  285556 2020-05-07                  4    345800     441942
2  285556 2020-05-08                  3    226062     339148
2  285556 2020-05-09                  2    0               0
2  285556 2020-05-10                  1    0               0
2  285556 2020-05-11                  0    0               0 

etc...

However there is still an issue there. How could I tweak what I to get it so the dates go up to the current date, rather than the most recent date in the dataset?

Thank you in advance for any help!

来源：https://stackoverflow.com/questions/61726593/problem-with-pandas-efficiency-when-working-with-dates

标签

python

pandas

dataframe