Conditionally filling blank values in Pandas dataframes

夙愿已清 提交于 2019-12-20 03:03:01

问题


I have a datafarme which looks like as follows (there are more columns having been dropped off):

    memberID    shipping_country    
    264991      
    264991       Canada
    100          USA    
    5000         
    5000         UK

I'm trying to fill the blank cells with existing value of shipping country for each user:

    memberID    shipping_country    
    264991       Canada
    264991       Canada
    100          USA    
    5000         UK
    5000         UK

However, I'm not sure what's the most efficient way to do this on a large scale dataset. Perhaps, using a vectored groupby method?


回答1:


You can use chained groupbys, one with forward fill and one with backfill:

# replace blank values with `NaN` first:
df['shipping_country'].replace('',pd.np.nan,inplace=True)

df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()

   memberID shipping_country
0    264991           Canada
1    264991           Canada
2       100              USA
3      5000               UK
4      5000               UK

This method will also allow a group made up of all NaN to remain NaN:

>>> df
   memberID shipping_country
0    264991                 
1    264991           Canada
2       100              USA
3      5000                 
4      5000               UK
5         1                 
6         1                 

df['shipping_country'].replace('',pd.np.nan,inplace=True)

df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()

   memberID shipping_country
0    264991           Canada
1    264991           Canada
2       100              USA
3      5000               UK
4      5000               UK
5         1              NaN
6         1              NaN



回答2:


You can use GroupBy + ffill / bfill:

def filler(x):
    return x.ffill().bfill()

res = df.groupby('memberID')['shipping_country'].apply(filler)

A custom function is necessary as there's no combined Pandas method to ffill and bfill sequentially.

This also caters for the situation where all values are NaN for a specific memberID; in this case they will remain NaN.




回答3:


For the following sample dataframe (I added a memberID group that only contains '' in the shipping_country column):

   memberID shipping_country
0    264991                 
1    264991           Canada
2       100              USA
3      5000                 
4      5000               UK
5        54                 

This should work for you, and also as the behavior that if a memberID group only has empty string values ('') in shipping_country, those will be retained in the output df:

df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first').fillna('')

Yields:

   memberID shipping_country
0    264991           Canada
1    264991           Canada
2       100              USA
3      5000               UK
4      5000               UK
5        54                 

If you would like to leave the empty strings '' as NaN in the output df, then just remove the fillna(''), leaving:

df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first')


来源:https://stackoverflow.com/questions/52781993/conditionally-filling-blank-values-in-pandas-dataframes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!