python pandas How to remove outliers from a dataframe and replace with an average value of preceding records

久未见 提交于 2019-12-10 18:13:12

问题


I have a dataframe 16k records and multiple groups of countries and other fields. I have produced an initial output of the a data that looks like the snipit below. Now i need to do some data cleansing, manipulating, remove skews or outliers and replace it with a value based on certain rules.

i.e. on the below how could i identify the skewed points (any value greater than 1) and replace them with the average of the next two records or previous record if there no later records.(in that group)

So in the dataframe below I would like to replace Bill%4 for IT week1 of 1.21 with the average of week2 and week3 for IT so it is 0.81.

any tricks for this?

Country Week    Bill%1  Bill%2  Bill%3  Bill%4  Bill%5  Bill%6
IT     week1    0.94    0.88    0.85    1.21    0.77    0.75
IT     week2    0.93    0.88    1.25    0.80    0.77    0.72
IT     week3    0.94    1.33    0.85    0.82    0.76    0.76
IT     week4    1.39    0.89    0.86    0.80    0.80    0.76
FR     week1    0.92    0.86    0.82    1.18    0.75    0.73
FR     week2    0.91    0.86    1.22    0.78    0.75    0.71
FR     week3    0.92    1.29    0.83    0.80    0.75    0.75
FR     week4    1.35    0.87    0.84    0.78    0.78    0.74

回答1:


I don't know of any built-ins to do this, but you should be able to customize this to meet your needs, no?

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.index = list('abcdeflght')

# Define cutoff value
cutoff = 0.90

for col in df.columns: 
    # Identify index locations above cutoff
    outliers = df[col][ df[col]>cutoff ]

    # Browse through outliers and average according to index location
    for idx in outliers.index:
        # Get index location 
        loc = df.index.get_loc(idx)

        # If not one of last two values in dataframe
        if loc<df.shape[0]-2:
            df[col][loc] = np.mean( df[col][loc+1:loc+3] )
        else: 
            df[col][loc] = np.mean( df[col][loc-3:loc-1] )


来源:https://stackoverflow.com/questions/20887194/python-pandas-how-to-remove-outliers-from-a-dataframe-and-replace-with-an-averag

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!