Detecting outliers in a Pandas dataframe using a rolling standard deviation

你离开我真会死。 提交于 2020-03-03 08:48:32

问题


I have a DataFrame for a fast Fourier transformed signal.

There is one column for the frequency in Hz and another column for the corresponding amplitude.

I have read a post made a couple of years ago, that you can use a simple boolean function to exclude or only include outliers in the final data frame that are above or below a few standard deviations.

df = pd.DataFrame({'Data':np.random.normal(size=200)})  # example dataset of normally distributed data. 
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] # or if you prefer the other way around

The problem is that my signal drops several magnitudes (up to 10 000 times smaller) as frequency increases up to 50 000Hz. Therefore, I am unable to use a function that only exports values above 3 standard deviation because I will only pick up the "peaks" outliers from the first 50 Hz.

Is there a way I can export outliers in my dataframe that are above 3 rolling standard deviations of a rolling mean instead?


回答1:


This is maybe best illustrated with a quick example. Basically you're comparing your existing data to a new column that is the rolling mean plus three standard deviations, also on a rolling basis.

import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Data':np.random.normal(size=200)})

# Create a few outliers (3 of them, at index locations 10, 55, 80)
df.iloc[[10, 55, 80]] = 40.    

r = df.rolling(window=20)  # Create a rolling object (no computation yet)
mps = r.mean() + 3. * r.std()  # Combine a mean and stdev on that object

print(df[df.Data > mps.Data])  # Boolean filter
#     Data
# 55  40.0
# 80  40.0

To add a new column filtering only to outliers, with NaN elsewhere:

df['Peaks'] = df['Data'].where(df.Data > mps.Data, np.nan)

print(df.iloc[50:60])
        Data  Peaks
50  -1.29409    NaN
51  -1.03879    NaN
52   1.74371    NaN
53  -0.79806    NaN
54   0.02968    NaN
55  40.00000   40.0
56   0.89071    NaN
57   1.75489    NaN
58   1.49564    NaN
59   1.06939    NaN

Here .where returns

An object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.



来源:https://stackoverflow.com/questions/46796265/detecting-outliers-in-a-pandas-dataframe-using-a-rolling-standard-deviation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!