Change values in pandas dataframe according to value_counts()

后端 未结 2 2000
执念已碎
执念已碎 2020-12-21 02:02

I have following pandas dataframe :

import pandas as pd 
from pandas import Series, DataFrame

data = DataFrame({\'Qu1\': [\'apple\', \'potato\', \'cheese\',         


        
相关标签:
2条回答
  • 2020-12-21 02:50

    You could:

    value_counts = df.apply(lambda x: x.value_counts())
    
             Qu1  Qu2  Qu3
    apple    1.0  3.0  1.0
    banana   2.0  4.0  NaN
    cheese   3.0  NaN  3.0
    egg      1.0  NaN  1.0
    potato   2.0  NaN  3.0
    sausage  NaN  2.0  1.0
    

    Then build a dictionary that will contain the replacements for each column:

    import cycle
    replacements = {}
    for col, s in value_counts.items():
        if s[s<2].any():
            replacements[col] = dict(zip(s[s < 2].index.tolist(), cycle(['other'])))
    
    replacements
    {'Qu1': {'egg': 'other', 'apple': 'other'}, 'Qu3': {'egg': 'other', 'apple': 'other', 'sausage': 'other'}}
    

    Use the dictionary to replace the values:

    df.replace(replacements)
    
          Qu1      Qu2     Qu3
    0   other  sausage   other
    1  potato   banana  potato
    2  cheese    apple   other
    3  banana    apple  cheese
    4  cheese    apple  cheese
    5  banana  sausage  potato
    6  cheese   banana  cheese
    7  potato   banana  potato
    8   other   banana   other
    

    or wrap the loop in a dictionary comprehension:

    from itertools import cycle
    
    df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()})
    

    However, this is not only more cumbersome but also slower than using .where. Testing with 3,000 columns:

    df = pd.concat([df for i in range(1000)], axis=1)
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 9 entries, 0 to 8
    Columns: 3000 entries, Qu1 to Qu3
    dtypes: object(3000)
    

    Using .replace():

    %%timeit
    value_counts = df.apply(lambda x: x.value_counts())
    df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()})
    
    1 loop, best of 3: 4.97 s per loop
    

    vs .where():

    %%timeit
    df.where(df.apply(lambda x: x.map(x.value_counts()))>=2, "other")
    
    1 loop, best of 3: 2.01 s per loop
    
    0 讨论(0)
  • 2020-12-21 03:01

    I would create a dataframe of same shape where the corresponding entry is the value count:

    data.apply(lambda x: x.map(x.value_counts()))
    Out[229]: 
       Qu1  Qu2  Qu3
    0    1    2    1
    1    2    4    3
    2    3    3    1
    3    2    3    3
    4    3    3    3
    5    2    2    3
    6    3    4    3
    7    2    4    3
    8    1    4    1
    

    And, use the results in df.where to return "other" where the corresponding entry is smaller than 2:

    data.where(data.apply(lambda x: x.map(x.value_counts()))>=2, "other")
    
          Qu1      Qu2     Qu3
    0   other  sausage   other
    1  potato   banana  potato
    2  cheese    apple   other
    3  banana    apple  cheese
    4  cheese    apple  cheese
    5  banana  sausage  potato
    6  cheese   banana  cheese
    7  potato   banana  potato
    8   other   banana   other
    
    0 讨论(0)
提交回复
热议问题