panda df iteration, binning of data based on time in milliseconds

问题

I have refocused my questions and have tried to be as specific as possible. below, I also include code I have used so far;

(1) When pulling data from SQL, my time is in a mixed format that contains a letter which is hard to work with. To avoid issues with that, i tried to apply; df.time=pd.to_timedelta(df.time, unit='ms'), which is fine by dont know how to extract the hours and minutes. Example;2019.11.22D01:18:00.01000, i just need to have column 'time' in following format; '01:18:00.01000'. Maybe i can use 'np.datetime64' to convert all my SQL time entries to the desired format and truncate the amount of characters I need? Please advise team. I also tried 'data=np.datetime64('time') but getting ' Error parsing datetime string "time" at position 0 '.

(2) I am attempting to group my data below by 2 factors, firstly , by 'data2' and then by 'time'. This is because my data will not be in the order below but rather in a random order. I get: ' DataFrameGroupBy' is not callable. Is that because i have repeating data2 values? Could you please help with what is causing this?

(3) So after I have grouped my data by 'data2' and 'time', I then need to bin the data within predefined time intervals (i.e. [0=10ms), [10-20ms) etc), so rows 0,1,2 will fall under the [0-10ms) bin, for example. Thus, I need to be able to define these bins first ( I will have a fixed set of bins ). Then ,for the next 'data2' change (i.e. from 55 to 56 lets say), we set start time as 0 and bin the row data based on time elapsed from 0 until data2 changes again. And so on. How can I code this, where I struggle the most is setting timer to '0' and referencing 'time' for every row as long as 'data2' value hasn't changed. Then when 'data2' changes, start all over, binning data accordingly.

Below is the code I have used so far;

import pyodbc 
import pandas as pd
import numpy as np

conn = pyodbc.connect('Driver={SQL Server};'
                      'Server=XXXXXXXXX;'
                      'Database=Dynamics;'
                      'Trusted_Connection=yes;')

cursor = conn.cursor()

SQL_Query = pd.read_sql_query('''select ID,time,data1,data2,data3,data4,data5 from Dynamics''', conn)
df = pd.DataFrame(SQL_Query, columns=['ID','time','data2','data3','data4','data5'])
df.time=pd.to_timedelta(df.time, unit='ms')
df[['data4']] = df[['data4']].apply(pd.to_numeric)
df['diff']=df['data4']-df['data5']
df['diff']=df['diff'].abs()
df=df.groupby(['data3','time'])
print(df)



                     time data_1  data_2 data_3  data_4  data_5
0 2019-11-22 01:18:00.010      a      55      A    1.20    1.24
1 2019-11-22 01:18:00.090      a      55      B    1.25    1.24
2 2019-11-22 01:18:00.100      a      55      C    1.26    1.24
3 2019-11-22 01:18:00.140      a      55      A    1.22    1.22
4 2019-11-22 01:18:00.160      a      55      B    1.23    1.22

回答1:

Pandas has a great feature of date ranges. Here is an example that creates a one-minute range, with a new millisecond on each row (which is also the index).

import pandas as pd
from datetime import timedelta
import numpy as np

date_rng = pd.date_range(start='2019-11-22T01:18:00.00100', end='2019-11-22T01:19:00.00000', freq='ms') #one minute, in milliseconds
n = len(date_rng) # n = 60000
values = np.random.random(n) # make n random numbers

df = pd.DataFrame({'values': values}, index=date_rng)
print ('dataframe: ')
print (df.head())

This is the head of df:

dataframe: 
                           values
2019-11-22 01:18:00.001  0.914796
2019-11-22 01:18:00.002  0.760555
2019-11-22 01:18:00.003  0.132992
2019-11-22 01:18:00.004  0.572391
2019-11-22 01:18:00.005  0.090188

Next, Pandas has a nice resample feature which, in this example, sums the values in 10 ms bins.

df2 = df.resample(rule=timedelta(milliseconds=10)).sum() # df2 sums the values in 10 ms bins
print ('beginning of df2')
print (df2.head())
print ('...')
print (df2.tail())

Here is the output:

beginning of df2
                           values
2019-11-22 01:18:00.000  5.236037
2019-11-22 01:18:00.010  4.446964
2019-11-22 01:18:00.020  6.549635
2019-11-22 01:18:00.030  5.141522
2019-11-22 01:18:00.040  5.375919
...
                           values
2019-11-22 01:18:59.960  3.876523
2019-11-22 01:18:59.970  4.864252
2019-11-22 01:18:59.980  5.690987
2019-11-22 01:18:59.990  2.787247
2019-11-22 01:19:00.000  0.613545

Note that the last value is much smaller, as only 1 ms is represented.

来源：https://stackoverflow.com/questions/60319858/panda-df-iteration-binning-of-data-based-on-time-in-milliseconds

标签

python

pandas

loops

dataframe

binning