Is there function that can remove the outliers?

帅比萌擦擦* 提交于 2020-01-30 06:19:06

问题


I find a function to detect outliers from columns but I do not know how to remove the outliers

is there a function for excluding or removing outliers from the columns

Here is the function to detect the outlier but I need help in a function to remove the outliers

import numpy as np
import pandas as pd
outliers=[]
def detect_outlier(data_1):

    threshold=3
    mean_1 = np.mean(data_1)
    std_1 =np.std(data_1)


    for y in data_1:
        z_score= (y - mean_1)/std_1 
        if np.abs(z_score) > threshold:
            outliers.append(y)
    return outliers

Here the printing outliers

#printing the outlier 
outlier_datapoints = detect_outlier(df['Pre_TOTAL_PURCHASE_ADJ'])
print(outlier_datapoints)

回答1:


I presume that by "remove the outliers" you mean "remove rows from the df dataframe which contain an outlier in the 'Pre_TOTAL_PURCHASE_ADJ' column." If this is incorrect, perhaps you could revise the question to make your meaning clear.

Sample data are also helpful, rather than forcing would-be answerers to formulate their own.

It's generally much more efficient to avoid iterating over the rows of a dataframe. For row selections so-called Boolean array indexing is a fast way of achieving your ends. Since you already have a predicate (function returning a truth value) that will identify the rows you want to exclude, you can use such a predicate to build another dataframe that contains only the outliers, or (by negating the predicate) only the non-outliers.

Since @political_scientist has already given a practical solution using scipy.stats.zscore to produce the predicate values in a new is_outlier column I will leave this answer as simple, general advice for working in numpy and pandas. Given that answer, the rows you want would be given by

df[~df['is_outlier']]

though it might be slightly more comprehensible to include the negation (~) in the generation of the selector column rather than in the indexing as above, renaming the column 'is_not_outlier'.




回答2:


An easy solution would be to use scipy.stats.zscore

from scipy.stats import zscore
# calculates z-score values
df["zscore"] = zscore(df["Pre_TOTAL_PURCHASE_ADJ"]) 

# creates `is_outlier` column with either True or False values, 
# so that you could filter your dataframe accordingly
df["is_outlier"] = df["zscore"].apply(lambda x: x <= -1.96 or x >= 1.96)



回答3:


Here are 2 methods for one-dimentional datasets.

Part 1: using upper and lower limit to 3 standard deviation

import numpy as np

# Function to Detection Outlier on one-dimentional datasets.
anomalies = []
def find_anomalies(data):
    # Set upper and lower limit to 3 standard deviation
    data_std = np.std(data)
    data_mean = np.mean(data)
    anomaly_cut_off = data_std * 3

    lower_limit = data_mean - anomaly_cut_off 
    upper_limit = data_mean + anomaly_cut_off

    # Generate outliers
    for outlier in data:
        if outlier > upper_limit or outlier < lower_limit:
            anomalies.append(outlier)
    return anomalies

Part 2: Using IQR (interquartile range)

q1, q3= np.percentile(data,[25,75]) # get percentiles
iqr = q3 - q1 # the IQR value
lower_bound = q1 - (1.5 * iqr) # lower bound
upper_bound = q3 + (1.5 * iqr) # upper bound

np.sum(data > upper_bound) # how many datapoints are above the upper bound?



回答4:


def outlier():
    import pandas as pd
    df1=pd.read_csv("......\\train.csv")
    _, bp = pd.DataFrame.boxplot(df1, return_type='both')
    outliers = [flier.get_ydata() for flier in bp["fliers"]]
    out_liers = [i.tolist() for i in outliers]


来源:https://stackoverflow.com/questions/57161413/is-there-function-that-can-remove-the-outliers

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!