Replace NaN or missing values with rolling mean or other interpolation

后端 未结 2 1171
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-06 01:34

I have a pandas dataframe with monthly data that I want to compute a 12 months moving average for. Data for for every month of January is missing, however (NaN), so I am usi

相关标签:
2条回答
  • 2020-12-06 02:08

    There are several ways to approach this, and the best way will depend on whether the January data is systematically different from other months. Most real-world data is likely to be somewhat seasonal, so let's use the average high temperature (Fahrenheit) of a random city in the northern hemisphere as an example.

    df=pd.DataFrame({ 'month' : [10,11,12,1,2,3],
                      'temp'  : [65,50,45,np.nan,40,43] }).set_index('month')
    

    You could use a rolling mean as you suggest, but the issue is that you will get an average temperature over the entire year, which ignores the fact that January is the coldest month. To correct for this, you could reduce the window to 3, which results in the January temp being the average of the December and February temps. (I am also using min_periods=1 as suggested in @user394430's answer.)

    df['rollmean12'] = df['temp'].rolling(12,center=True,min_periods=1).mean()
    df['rollmean3']  = df['temp'].rolling( 3,center=True,min_periods=1).mean()
    

    Those are improvements but still have the problem of overwriting existing values with rolling means. To avoid this you could combine with the update() method (see documentation here).

    df['update'] = df['rollmean3']
    df['update'].update( df['temp'] )  # note: this is an inplace operation
    

    There are even simpler approaches that leave the existing values alone while filling the missing January temps with either the previous month, next month, or the mean of the previous and next month.

    df['ffill']   = df['temp'].ffill()         # previous month 
    df['bfill']   = df['temp'].bfill()         # next month
    df['interp']  = df['temp'].interpolate()   # mean of prev/next
    

    In this case, interpolate() defaults to simple linear interpretation, but you have several other intepolation options also. See documentation on pandas interpolate for more info. Or this statck overflow question: Interpolation on DataFrame in pandas

    Here is the sample data with all the results:

           temp  rollmean12  rollmean3  update  ffill  bfill  interp
    month                                                           
    10     65.0        48.6  57.500000    65.0   65.0   65.0    65.0
    11     50.0        48.6  53.333333    50.0   50.0   50.0    50.0
    12     45.0        48.6  47.500000    45.0   45.0   45.0    45.0
    1       NaN        48.6  42.500000    42.5   45.0   40.0    42.5
    2      40.0        48.6  41.500000    40.0   40.0   40.0    40.0
    3      43.0        48.6  41.500000    43.0   43.0   43.0    43.0
    

    In particular, note that "update" and "interp" give the same results in all months. While it doesn't matter which one you use here, in other cases one way or the other might be better.

    0 讨论(0)
  • 2020-12-06 02:27

    The real key is having min_periods=1. Also, as of version 18, the proper calling is with a Rolling object. Therefore, your code should be

    data["variable"].rolling(min_periods=1, center=True, window=12).mean().

    0 讨论(0)
提交回复
热议问题