Conditionally filling blank values in Pandas dataframes

问题

I have a datafarme which looks like as follows (there are more columns having been dropped off):

    memberID    shipping_country    
    264991      
    264991       Canada
    100          USA    
    5000         
    5000         UK

I'm trying to fill the blank cells with existing value of shipping country for each user:

    memberID    shipping_country    
    264991       Canada
    264991       Canada
    100          USA    
    5000         UK
    5000         UK

However, I'm not sure what's the most efficient way to do this on a large scale dataset. Perhaps, using a vectored groupby method?

回答1:

You can use chained groupbys, one with forward fill and one with backfill:

# replace blank values with `NaN` first:
df['shipping_country'].replace('',pd.np.nan,inplace=True)

df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()

   memberID shipping_country
0    264991           Canada
1    264991           Canada
2       100              USA
3      5000               UK
4      5000               UK

This method will also allow a group made up of all NaN to remain NaN:

>>> df
   memberID shipping_country
0    264991                 
1    264991           Canada
2       100              USA
3      5000                 
4      5000               UK
5         1                 
6         1                 

df['shipping_country'].replace('',pd.np.nan,inplace=True)

df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()

   memberID shipping_country
0    264991           Canada
1    264991           Canada
2       100              USA
3      5000               UK
4      5000               UK
5         1              NaN
6         1              NaN

回答2:

You can use GroupBy + ffill / bfill:

def filler(x):
    return x.ffill().bfill()

res = df.groupby('memberID')['shipping_country'].apply(filler)

A custom function is necessary as there's no combined Pandas method to ffill and bfill sequentially.

This also caters for the situation where all values are NaN for a specific memberID; in this case they will remain NaN.

回答3:

For the following sample dataframe (I added a memberID group that only contains '' in the shipping_country column):

   memberID shipping_country
0    264991                 
1    264991           Canada
2       100              USA
3      5000                 
4      5000               UK
5        54

This should work for you, and also as the behavior that if a memberID group only has empty string values ('') in shipping_country, those will be retained in the output df:

df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first').fillna('')

Yields:

   memberID shipping_country
0    264991           Canada
1    264991           Canada
2       100              USA
3      5000               UK
4      5000               UK
5        54

If you would like to leave the empty strings '' as NaN in the output df, then just remove the fillna(''), leaving:

df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first')

来源：https://stackoverflow.com/questions/52781993/conditionally-filling-blank-values-in-pandas-dataframes

标签

python

pandas

dataframe

pandas-groupby

series