I would like to calculate the mean and standard deviation of a timedelta by bank from a dataframe with two columns shown
No need to convert timedelta back and forth. Numpy and pandas can seamlessly do it for you with a faster run time. Using your dropped DataFrame:
import numpy as np
grouped = dropped.groupby('bank')['diff']
mean = grouped.apply(lambda x: np.mean(x))
std = grouped.apply(lambda x: np.std(x))
I would suggest passing the numeric_only=False argument to mean as mentioned by Alexander Usikov - this works for pandas version 0.20+.
If you have an older version, the following works:
import pandas pd
df = pd.DataFrame({
'td': pd.Series([pd.Timedelta(days=i) for i in range(5)]),
'group': ['a', 'a', 'a', 'b', 'b']
})
(
df
.astype({'td': int}) # convert timedelta to integer (nanoseconds)
.groupby('group')
.mean()
.astype({'td': 'timedelta64[ns]'})
)
Pandas mean() and other aggregation methods support numeric_only=False parameter.
dropped.groupby('bank').mean(numeric_only=False)
Found here: Aggregations for Timedelta values in the Python DataFrame
You need to convert timedelta to some numeric value, e.g. int64 by values what is most accurate, because convert to ns is what is the numeric representation of timedelta:
dropped['new'] = dropped['diff'].values.astype(np.int64)
means = dropped.groupby('bank').mean()
means['new'] = pd.to_timedelta(means['new'])
std = dropped.groupby('bank').std()
std['new'] = pd.to_timedelta(std['new'])
Another solution is to convert values to seconds by total_seconds, but that is less accurate:
dropped['new'] = dropped['diff'].dt.total_seconds()
means = dropped.groupby('bank').mean()