Is there a numpy builtin to reject outliers from a list

后端 未结 10 761
孤城傲影
孤城傲影 2020-11-28 18:00

Is there a numpy builtin to do something like the following? That is, take a list d and return a list filtered_d with any outlying elements removed

相关标签:
10条回答
  • 2020-11-28 18:42

    Building on Benjamin's, using pandas.Series, and replacing MAD with IQR:

    def reject_outliers(sr, iq_range=0.5):
        pcnt = (1 - iq_range) / 2
        qlow, median, qhigh = sr.dropna().quantile([pcnt, 0.50, 1-pcnt])
        iqr = qhigh - qlow
        return sr[ (sr - median).abs() <= iqr]
    

    For instance, if you set iq_range=0.6, the percentiles of the interquartile-range would become: 0.20 <--> 0.80, so more outliers will be included.

    0 讨论(0)
  • 2020-11-28 18:42

    I wanted to do something similar, except setting the number to NaN rather than removing it from the data, since if you remove it you change the length which can mess up plotting (i.e. if you're only removing outliers from one column in a table, but you need it to remain the same as the other columns so you can plot them against each other).

    To do so I used numpy's masking functions:

    def reject_outliers(data, m=2):
        stdev = np.std(data)
        mean = np.mean(data)
        maskMin = mean - stdev * m
        maskMax = mean + stdev * m
        mask = np.ma.masked_outside(data, maskMin, maskMax)
        print('Masking values outside of {} and {}'.format(maskMin, maskMax))
        return mask
    
    0 讨论(0)
  • 2020-11-28 18:42

    if you want to get the index position of the outliers idx_list will return it.

    def reject_outliers(data, m = 2.):
            d = np.abs(data - np.median(data))
            mdev = np.median(d)
            s = d/mdev if mdev else 0.
            data_range = np.arange(len(data))
            idx_list = data_range[s>=m]
            return data[s<m], idx_list
    
    data_points = np.array([8, 10, 35, 17, 73, 77])  
    print(reject_outliers(data_points))
    
    after rejection: [ 8 10 35 17], index positions of outliers: [4 5]
    
    0 讨论(0)
  • 2020-11-28 18:49

    I would like to provide two methods in this answer, solution based on "z score" and solution based on "IQR".

    The code provided in this answer works on both single dim numpy array and multiple numpy array.

    Let's import some modules firstly.

    import collections
    import numpy as np
    import scipy.stats as stat
    from scipy.stats import iqr
    

    z score based method

    This method will test if the number falls outside the three standard deviations. Based on this rule, if the value is outlier, the method will return true, if not, return false.

    def sd_outlier(x, axis = None, bar = 3, side = 'both'):
        assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'
    
        d_z = stat.zscore(x, axis = axis)
    
        if side == 'gt':
            return d_z > bar
        elif side == 'lt':
            return d_z < -bar
        elif side == 'both':
            return np.abs(d_z) > bar
    

    IQR based method

    This method will test if the value is less than q1 - 1.5 * iqr or greater than q3 + 1.5 * iqr, which is similar to SPSS's plot method.

    def q1(x, axis = None):
        return np.percentile(x, 25, axis = axis)
    
    def q3(x, axis = None):
        return np.percentile(x, 75, axis = axis)
    
    def iqr_outlier(x, axis = None, bar = 1.5, side = 'both'):
        assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'
    
        d_iqr = iqr(x, axis = axis)
        d_q1 = q1(x, axis = axis)
        d_q3 = q3(x, axis = axis)
        iqr_distance = np.multiply(d_iqr, bar)
    
        stat_shape = list(x.shape)
    
        if isinstance(axis, collections.Iterable):
            for single_axis in axis:
                stat_shape[single_axis] = 1
        else:
            stat_shape[axis] = 1
    
        if side in ['gt', 'both']:
            upper_range = d_q3 + iqr_distance
            upper_outlier = np.greater(x - upper_range.reshape(stat_shape), 0)
        if side in ['lt', 'both']:
            lower_range = d_q1 - iqr_distance
            lower_outlier = np.less(x - lower_range.reshape(stat_shape), 0)
    
        if side == 'gt':
            return upper_outlier
        if side == 'lt':
            return lower_outlier
        if side == 'both':
            return np.logical_or(upper_outlier, lower_outlier)
    

    Finally, if you want to filter out the outliers, use a numpy selector.

    Have a nice day.

    0 讨论(0)
  • 2020-11-28 18:56

    Benjamin Bannier's answer yields a pass-through when the median of distances from the median is 0, so I found this modified version a bit more helpful for cases as given in the example below.

    def reject_outliers_2(data, m=2.):
        d = np.abs(data - np.median(data))
        mdev = np.median(d)
        s = d / (mdev if mdev else 1.)
        return data[s < m]
    

    Example:

    data_points = np.array([10, 10, 10, 17, 10, 10])
    print(reject_outliers(data_points))
    print(reject_outliers_2(data_points))
    

    Gives:

    [[10, 10, 10, 17, 10, 10]]  # 17 is not filtered
    [10, 10, 10, 10, 10]  # 17 is filtered (it's distance, 7, is greater than m)
    
    0 讨论(0)
  • 2020-11-28 19:00

    Something important when dealing with outliers is that one should try to use estimators as robust as possible. The mean of a distribution will be biased by outliers but e.g. the median will be much less.

    Building on eumiro's answer:

    def reject_outliers(data, m = 2.):
        d = np.abs(data - np.median(data))
        mdev = np.median(d)
        s = d/mdev if mdev else 0.
        return data[s<m]
    

    Here I have replace the mean with the more robust median and the standard deviation with the median absolute distance to the median. I then scaled the distances by their (again) median value so that m is on a reasonable relative scale.

    Note that for the data[s<m] syntax to work, data must be a numpy array.

    0 讨论(0)
提交回复
热议问题