Is there function that can remove the outliers?

前端 未结 4 964
生来不讨喜
生来不讨喜 2021-01-19 10:17

I find a function to detect outliers from columns but I do not know how to remove the outliers

is there a function for excluding or removing outliers from the colum

4条回答
  •  佛祖请我去吃肉
    2021-01-19 10:45

    I presume that by "remove the outliers" you mean "remove rows from the df dataframe which contain an outlier in the 'Pre_TOTAL_PURCHASE_ADJ' column." If this is incorrect, perhaps you could revise the question to make your meaning clear.

    Sample data are also helpful, rather than forcing would-be answerers to formulate their own.

    It's generally much more efficient to avoid iterating over the rows of a dataframe. For row selections so-called Boolean array indexing is a fast way of achieving your ends. Since you already have a predicate (function returning a truth value) that will identify the rows you want to exclude, you can use such a predicate to build another dataframe that contains only the outliers, or (by negating the predicate) only the non-outliers.

    Since @political_scientist has already given a practical solution using scipy.stats.zscore to produce the predicate values in a new is_outlier column I will leave this answer as simple, general advice for working in numpy and pandas. Given that answer, the rows you want would be given by

    df[~df['is_outlier']]
    

    though it might be slightly more comprehensible to include the negation (~) in the generation of the selector column rather than in the indexing as above, renaming the column 'is_not_outlier'.

提交回复
热议问题