Is there a numpy builtin to reject outliers from a list

后端 未结 10 767
孤城傲影
孤城傲影 2020-11-28 18:00

Is there a numpy builtin to do something like the following? That is, take a list d and return a list filtered_d with any outlying elements removed

10条回答
  •  攒了一身酷
    2020-11-28 19:07

    An alternative is to make a robust estimation of the standard deviation (assuming Gaussian statistics). Looking up online calculators, I see that the 90% percentile corresponds to 1.2815σ and the 95% is 1.645σ (http://vassarstats.net/tabs.html?#z)

    As a simple example:

    import numpy as np
    
    # Create some random numbers
    x = np.random.normal(5, 2, 1000)
    
    # Calculate the statistics
    print("Mean= ", np.mean(x))
    print("Median= ", np.median(x))
    print("Max/Min=", x.max(), " ", x.min())
    print("StdDev=", np.std(x))
    print("90th Percentile", np.percentile(x, 90))
    
    # Add a few large points
    x[10] += 1000
    x[20] += 2000
    x[30] += 1500
    
    # Recalculate the statistics
    print()
    print("Mean= ", np.mean(x))
    print("Median= ", np.median(x))
    print("Max/Min=", x.max(), " ", x.min())
    print("StdDev=", np.std(x))
    print("90th Percentile", np.percentile(x, 90))
    
    # Measure the percentile intervals and then estimate Standard Deviation of the distribution, both from median to the 90th percentile and from the 10th to 90th percentile
    p90 = np.percentile(x, 90)
    p10 = np.percentile(x, 10)
    p50 = np.median(x)
    # p50 to p90 is 1.2815 sigma
    rSig = (p90-p50)/1.2815
    print("Robust Sigma=", rSig)
    
    rSig = (p90-p10)/(2*1.2815)
    print("Robust Sigma=", rSig)
    

    The output I get is:

    Mean=  4.99760520022
    Median=  4.95395274981
    Max/Min= 11.1226494654   -2.15388472011
    Sigma= 1.976629928
    90th Percentile 7.52065379649
    
    Mean=  9.64760520022
    Median=  4.95667658782
    Max/Min= 2205.43861943   -2.15388472011
    Sigma= 88.6263902244
    90th Percentile 7.60646688694
    
    Robust Sigma= 2.06772555531
    Robust Sigma= 1.99878292462
    

    Which is close to the expected value of 2.

    If we want to remove points above/below 5 standard deviations (with 1000 points we would expect 1 value > 3 standard deviations):

    y = x[abs(x - p50) < rSig*5]
    
    # Print the statistics again
    print("Mean= ", np.mean(y))
    print("Median= ", np.median(y))
    print("Max/Min=", y.max(), " ", y.min())
    print("StdDev=", np.std(y))
    

    Which gives:

    Mean=  4.99755359935
    Median=  4.95213030447
    Max/Min= 11.1226494654   -2.15388472011
    StdDev= 1.97692712883
    

    I have no idea which approach is the more efficent/robust

提交回复
热议问题