matplotlib: disregard outliers when plotting

后端 未结 4 753
广开言路
广开言路 2020-12-04 13:13

I\'m plotting some data from various tests. Sometimes in a test I happen to have one outlier (say 0.1), while all other values are three orders of magnitude smaller.

<
相关标签:
4条回答
  • 2020-12-04 13:54

    If you aren't fussed about rejecting outliers as mentioned by Joe and it is purely aesthetic reasons for doing this, you could just set your plot's x axis limits:

    plt.xlim(min_x_data_value,max_x_data_value)
    

    Where the values are your desired limits to display.

    plt.ylim(min,max) works to set limits on the y axis also.

    0 讨论(0)
  • 2020-12-04 13:54

    I usually pass the data through the function np.clip, If you have some reasonable estimate of the maximum and minimum value of your data, just use that. If you don't have a reasonable estimate, the histogram of clipped data will show you the size of the tails, and if the outliers are really just outliers the tail should be small.

    What I run is something like this:

    import numpy as np
    import matplotlib.pyplot as plt
    
    data = np.random.normal(3, size=100000)
    plt.hist(np.clip(data, -15, 8), bins=333, density=True)
    

    You can compare the results if you change the min and max in the clipping function until you find the right values for your data.

    In this example, you can see immediately that the max value of 8 is not good because you are removing a lot of meaningful information. The min value of -15 should be fine since the tail is not even visible.

    You could probably write some code that based on this find some good bounds that minimize the sizes of the tails according to some tolerance.

    0 讨论(0)
  • 2020-12-04 14:07

    There's no single "best" test for an outlier. Ideally, you should incorporate a-priori information (e.g. "This parameter shouldn't be over x because of blah...").

    Most tests for outliers use the median absolute deviation, rather than the 95th percentile or some other variance-based measurement. Otherwise, the variance/stddev that is calculated will be heavily skewed by the outliers.

    Here's a function that implements one of the more common outlier tests.

    def is_outlier(points, thresh=3.5):
        """
        Returns a boolean array with True if points are outliers and False 
        otherwise.
    
        Parameters:
        -----------
            points : An numobservations by numdimensions array of observations
            thresh : The modified z-score to use as a threshold. Observations with
                a modified z-score (based on the median absolute deviation) greater
                than this value will be classified as outliers.
    
        Returns:
        --------
            mask : A numobservations-length boolean array.
    
        References:
        ----------
            Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
            Handle Outliers", The ASQC Basic References in Quality Control:
            Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. 
        """
        if len(points.shape) == 1:
            points = points[:,None]
        median = np.median(points, axis=0)
        diff = np.sum((points - median)**2, axis=-1)
        diff = np.sqrt(diff)
        med_abs_deviation = np.median(diff)
    
        modified_z_score = 0.6745 * diff / med_abs_deviation
    
        return modified_z_score > thresh
    

    As an example of using it, you'd do something like the following:

    import numpy as np
    import matplotlib.pyplot as plt
    
    # The function above... In my case it's in a local utilities module
    from sci_utilities import is_outlier
    
    # Generate some data
    x = np.random.random(100)
    
    # Append a few "bad" points
    x = np.r_[x, -3, -10, 100]
    
    # Keep only the "good" points
    # "~" operates as a logical not operator on boolean numpy arrays
    filtered = x[~is_outlier(x)]
    
    # Plot the results
    fig, (ax1, ax2) = plt.subplots(nrows=2)
    
    ax1.hist(x)
    ax1.set_title('Original')
    
    ax2.hist(filtered)
    ax2.set_title('Without Outliers')
    
    plt.show()
    

    enter image description here

    0 讨论(0)
  • 2020-12-04 14:09

    I think using pandas quantile is useful and much more flexible.

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    
    fig = plt.figure()
    ax1 = fig.add_subplot(121)
    ax2 = fig.add_subplot(122)
    
    pd_series = pd.Series(np.random.normal(size=300)) 
    pd_series_adjusted = pd_series[pd_series.between(pd_series.quantile(.05), pd_series.quantile(.95))] 
    
    ax1.boxplot(pd_series)
    ax1.set_title('Original')
    
    ax2.boxplot(pd_series_adjusted)
    ax2.set_title('Adjusted')
    
    plt.show()
    

    0 讨论(0)
提交回复
热议问题