Is there a numpy builtin to reject outliers from a list

后端 未结 10 762
孤城傲影
孤城傲影 2020-11-28 18:00

Is there a numpy builtin to do something like the following? That is, take a list d and return a list filtered_d with any outlying elements removed

相关标签:
10条回答
  • 2020-11-28 19:01

    This method is almost identical to yours, just more numpyst (also working on numpy arrays only):

    def reject_outliers(data, m=2):
        return data[abs(data - np.mean(data)) < m * np.std(data)]
    
    0 讨论(0)
  • 2020-11-28 19:01

    Consider that all the above methods fail when your standard deviation gets very large due to huge outliers.

    (Simalar as the average caluclation fails and should rather caluclate the median. Though, the average is "more prone to such an error as the stdDv".)

    You could try to iteratively apply your algorithm or you filter using the interquartile range: (here "factor" relates to a n*sigma range, yet only when your data follows a Gaussian distribution)

    import numpy as np
    
    def sortoutOutliers(dataIn,factor):
        quant3, quant1 = np.percentile(dataIn, [75 ,25])
        iqr = quant3 - quant1
        iqrSigma = iqr/1.34896
        medData = np.median(dataIn)
        dataOut = [ x for x in dataIn if ( (x > medData - factor* iqrSigma) and (x < medData + factor* iqrSigma) ) ] 
        return(dataOut)
    
    0 讨论(0)
  • 2020-11-28 19:01

    For a set of images (each image has 3 dimensions), where I wanted to reject outliers for each pixel I used:

    mean = np.mean(imgs, axis=0)
    std = np.std(imgs, axis=0)
    mask = np.greater(0.5 * std + 1, np.abs(imgs - mean))
    masked = np.multiply(imgs, mask)
    

    Then it is possible to compute the mean:

    masked_mean = np.divide(np.sum(masked, axis=0), np.sum(mask, axis=0))
    

    (I use it for Background Subtraction)

    0 讨论(0)
  • 2020-11-28 19:07

    An alternative is to make a robust estimation of the standard deviation (assuming Gaussian statistics). Looking up online calculators, I see that the 90% percentile corresponds to 1.2815σ and the 95% is 1.645σ (http://vassarstats.net/tabs.html?#z)

    As a simple example:

    import numpy as np
    
    # Create some random numbers
    x = np.random.normal(5, 2, 1000)
    
    # Calculate the statistics
    print("Mean= ", np.mean(x))
    print("Median= ", np.median(x))
    print("Max/Min=", x.max(), " ", x.min())
    print("StdDev=", np.std(x))
    print("90th Percentile", np.percentile(x, 90))
    
    # Add a few large points
    x[10] += 1000
    x[20] += 2000
    x[30] += 1500
    
    # Recalculate the statistics
    print()
    print("Mean= ", np.mean(x))
    print("Median= ", np.median(x))
    print("Max/Min=", x.max(), " ", x.min())
    print("StdDev=", np.std(x))
    print("90th Percentile", np.percentile(x, 90))
    
    # Measure the percentile intervals and then estimate Standard Deviation of the distribution, both from median to the 90th percentile and from the 10th to 90th percentile
    p90 = np.percentile(x, 90)
    p10 = np.percentile(x, 10)
    p50 = np.median(x)
    # p50 to p90 is 1.2815 sigma
    rSig = (p90-p50)/1.2815
    print("Robust Sigma=", rSig)
    
    rSig = (p90-p10)/(2*1.2815)
    print("Robust Sigma=", rSig)
    

    The output I get is:

    Mean=  4.99760520022
    Median=  4.95395274981
    Max/Min= 11.1226494654   -2.15388472011
    Sigma= 1.976629928
    90th Percentile 7.52065379649
    
    Mean=  9.64760520022
    Median=  4.95667658782
    Max/Min= 2205.43861943   -2.15388472011
    Sigma= 88.6263902244
    90th Percentile 7.60646688694
    
    Robust Sigma= 2.06772555531
    Robust Sigma= 1.99878292462
    

    Which is close to the expected value of 2.

    If we want to remove points above/below 5 standard deviations (with 1000 points we would expect 1 value > 3 standard deviations):

    y = x[abs(x - p50) < rSig*5]
    
    # Print the statistics again
    print("Mean= ", np.mean(y))
    print("Median= ", np.median(y))
    print("Max/Min=", y.max(), " ", y.min())
    print("StdDev=", np.std(y))
    

    Which gives:

    Mean=  4.99755359935
    Median=  4.95213030447
    Max/Min= 11.1226494654   -2.15388472011
    StdDev= 1.97692712883
    

    I have no idea which approach is the more efficent/robust

    0 讨论(0)
提交回复
热议问题