What to do with missing values when plotting with seaborn?

前端 未结 3 945
别跟我提以往
别跟我提以往 2021-02-07 11:19

I replaced the missing values with NaN using lambda following function:

data = data.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)<

3条回答
  •  南旧
    南旧 (楼主)
    2021-02-07 11:37

    I would definitely handle missing values before you plot your data. Whether ot not to use dropna() would depend entirely on the nature of your dataset. Is alcconsumption a single series or part of a dataframe? In the latter case, using dropna() would remove the corresponding rows in other columns as well. Are the missing values few or many? Are they spread around in your series, or do they tend to occur in groups? Is there perhaps reason to believe that there is a trend in your dataset?

    If the missing values are few and scattered, you could easiliy use dropna(). In other cases I would choose to fill missing values with the previously observed value (1). Or even fill the missing values with interpolated values (2). But be careful! Replacing a lot of data with filled or interpolated observations could seriously interrupt your dataset and lead to very wrong conlusions.

    Here are some examples that use your snippet...

    seaborn.distplot(data['alcconsumption'],hist=True,bins=100)
    plt.xlabel('AlcoholConsumption')
    plt.ylabel('Frequency(normalized 0->1)')
    

    ... on a synthetic dataset:

    import pandas as pd
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    def sample(rows, names):
        ''' Function to create data sample with random returns
    
        Parameters
        ==========
        rows : number of rows in the dataframe
        names: list of names to represent assets
    
        Example
        =======
    
        >>> sample(rows = 2, names = ['A', 'B'])
    
                      A       B
        2017-01-01  0.0027  0.0075
        2017-01-02 -0.0050 -0.0024
        '''
        listVars= names
        rng = pd.date_range('1/1/2017', periods=rows, freq='D')
        df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars) 
        df_temp = df_temp.set_index(rng)
    
    
        return df_temp
    
    df = sample(rows = 15, names = ['A', 'B'])
    df['A'][8:12] = np.nan
    df
    

    Output:

                A   B
    2017-01-01 -63.0  10
    2017-01-02  49.0  79
    2017-01-03 -55.0  59
    2017-01-04  89.0  34
    2017-01-05 -13.0 -80
    2017-01-06  36.0  90
    2017-01-07 -41.0  86
    2017-01-08  10.0 -81
    2017-01-09   NaN -61
    2017-01-10   NaN -80
    2017-01-11   NaN -39
    2017-01-12   NaN  24
    2017-01-13 -73.0 -25
    2017-01-14 -40.0  86
    2017-01-15  97.0  60
    

    (1) Using forward fill with pandas.DataFrame.fillna(method = ffill)

    ffill will "fill values forward", meaning it will replace the nan's with the value of the row above.

    df = df['A'].fillna(axis=0, method='ffill')
    sns.distplot(df, hist=True,bins=5)
    plt.xlabel('AlcoholConsumption')
    plt.ylabel('Frequency(normalized 0->1)')
    

    (2) Using interpolation with pandas.DataFrame.interpolate()

    Interpolate values according to different methods. Time interpolation works on daily and higher resolution data to interpolate given length of interval.

    df['A'] = df['A'].interpolate(method = 'time')
    sns.distplot(df['A'], hist=True,bins=5)
    plt.xlabel('AlcoholConsumption')
    plt.ylabel('Frequency(normalized 0->1)')
    

    As you can see, the different methods render two very different results. I hope this will be useful to you. If not then let me know and I'll have a look at it again.

提交回复
热议问题