Applying custom function while grouping returns NaN

做~自己de王妃 提交于 2019-12-24 17:03:37

问题


Given a dict, performances, storing Series of kind:

2015-02-28           NaN
2015-03-02    100.000000
2015-03-03     98.997117
2015-03-04     98.909215
2015-03-05     99.909979
2015-03-06    100.161486
2015-03-09    100.502772
2015-03-10    101.685314
2015-03-11    102.518433
2015-03-12    102.427237
2015-03-13    103.424257
2015-03-16    102.669184
2015-03-17    102.181841
2015-03-18    102.436339
2015-03-19    102.672482
2015-03-20    102.238386
2015-03-23    101.460082
...

I want to group them by month, but only pick the first value which is not np.nan, for each month's data set:

for perf in performance:
    performance[perf] = performance[perf].groupby(performance[perf].index.month).apply(return_first)


def return_first(array_like):
    # Return data from 1st of month, or first value that is not np.nan
    for i in range(len(array_like)):
        if np.isnan(array_like[i]):
            continue
        else:
            return(array_like[i])

This, however returns nan values:

2015-02-28   NaN
2015-03-02   NaN
2015-03-03   NaN
2015-03-04   NaN
2015-03-05   NaN
2015-03-06   NaN
2015-03-09   NaN
2015-03-10   NaN
2015-03-11   NaN
2015-03-12   NaN
2015-03-13   NaN
2015-03-16   NaN
2015-03-17   NaN
2015-03-18   NaN
2015-03-19   NaN
2015-03-20   NaN
2015-03-23   NaN
...

When it should have been:

2015-03-02   100   
...

I cannot suspect my index, which seems to be a prefectly fine pd.DateTimeIndex:

DatetimeIndex(['2015-02-28', '2015-03-02', '2015-03-03', '2015-03-04',
           '2015-03-05', '2015-03-06', '2015-03-09', '2015-03-10',
           '2015-03-11', '2015-03-12',
           ...
           '2016-02-16', '2016-02-17', '2016-02-18', '2016-02-19',
           '2016-02-22', '2016-02-23', '2016-02-24', '2016-02-25',
           '2016-02-26', '2016-02-29'],
          dtype='datetime64[ns]', length=265, freq=None)

Where did I go wrong?


回答1:


If each month has at least one non NaN value, use first_valid_index:

print (df.b.groupby(df.index.month).apply(lambda x: x[x.first_valid_index()]))

More general solution, which return NaN if all values in some month are NaN:

def f(x):
    if x.first_valid_index() is None:
        return np.nan
    else:
        return x[x.first_valid_index()]

print (df.b.groupby(df.index.month).apply(f))

2      NaN
3    100.0
Name: b, dtype: float64

If you want group by years and months use to_period:

print (df.b.groupby(df.index.to_period('M')).apply(f))
2015-02      NaN
2015-03    100.0
Freq: M, Name: b, dtype: float64

Sample:

import pandas as pd
import numpy as np

df = pd.DataFrame({'b': pd.Series({ pd.Timestamp('2015-07-19 00:00:00'): 102.67248199999999,  pd.Timestamp('2015-04-05 00:00:00'):  np.nan,  pd.Timestamp('2015-02-25 00:00:00'):  np.nan,  pd.Timestamp('2015-04-09 00:00:00'): 100.50277199999999,  pd.Timestamp('2015-06-18 00:00:00'): 102.436339,  pd.Timestamp('2015-06-16 00:00:00'): 102.669184,  pd.Timestamp('2015-04-10 00:00:00'): 101.68531400000001,  pd.Timestamp('2015-05-12 00:00:00'): 102.42723700000001,  pd.Timestamp('2015-07-20 00:00:00'): 102.23838600000001,  pd.Timestamp('2015-06-17 00:00:00'):  np.nan,  pd.Timestamp('2015-08-23 00:00:00'): 101.460082,  pd.Timestamp('2015-03-03 00:00:00'): 98.997117000000003,  pd.Timestamp('2015-03-02 00:00:00'): 100.0,  pd.Timestamp('2015-05-11 00:00:00'): 102.518433,  pd.Timestamp('2015-03-04 00:00:00'): 98.909215000000003, pd.Timestamp('2015-05-13 00:00:00'): 103.424257,  pd.Timestamp('2015-04-06 00:00:00'):  np.nan})})
print (df)

                     b
2015-02-25         NaN
2015-03-02  100.000000
2015-03-03   98.997117
2015-03-04   98.909215
2015-04-05         NaN
2015-04-06         NaN
2015-04-09  100.502772
2015-04-10  101.685314
2015-05-11  102.518433
2015-05-12  102.427237
2015-05-13  103.424257
2015-06-16  102.669184
2015-06-17         NaN
2015-06-18  102.436339
2015-07-19  102.672482
2015-07-20  102.238386
2015-08-23  101.460082
def f(x):
    if x.first_valid_index() is None:
        return np.nan
    else:
        return x[x.first_valid_index()]

print (df.b.groupby(df.index.to_period('M')).apply(f))
2015-02           NaN
2015-03    100.000000
2015-04    100.502772
2015-05    102.518433
2015-06    102.669184
2015-07    102.672482
2015-08    101.460082
Freq: M, Name: b, dtype: float64


来源:https://stackoverflow.com/questions/37456532/applying-custom-function-while-grouping-returns-nan

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!