Remove low frequency values from pandas.dataframe

前端 未结 2 1560
忘掉有多难
忘掉有多难 2020-12-13 14:44

How can I remove values from a column in pandas.DataFrame, that occurs rarely, i.e. with a low frequency? Example:

In [4]: df[col_1].value_count         


        
相关标签:
2条回答
  • 2020-12-13 15:06

    I see there are two ways you might want to do this.

    For the entire DataFrame

    This method removes the values that occur infrequently in the entire DataFrame. We can do it without loops, using built-in functions to speed things up.

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
             columns = ['A', 'B'])
    
    threshold = 10 # Anything that occurs less than this will be removed.
    value_counts = df.stack().value_counts() # Entire DataFrame 
    to_remove = value_counts[value_counts <= threshold].index
    df.replace(to_remove, np.nan, inplace=True)
    

    Column-by-column

    This method removes the entries that occur infrequently in each column.

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
             columns = ['A', 'B'])
    
    threshold = 10 # Anything that occurs less than this will be removed.
    for col in df.columns:
        value_counts = df[col].value_counts() # Specific column 
        to_remove = value_counts[value_counts <= threshold].index
        df[col].replace(to_remove, np.nan, inplace=True)
    
    0 讨论(0)
  • 2020-12-13 15:09

    You probably don't want to remove the entire row in your DataFrame if only one column has values below your threshold, so I've simply removed these data points and replaced them with None.

    I loop through each column and perform a value_counts on each. I then get the index values for each items that occurs at or below the target threshold values. Finally, I use .loc to locate these elements values in the column and then replace them with None.

    df = pd.DataFrame({'A': ['a', 'b', 'b', 'c', 'c'], 
                       'B': ['a', 'a', 'b', 'c', 'c'], 
                       'C': ['a', 'a', 'b', 'b', 'c']})
    
    >>> df
       A  B  C
    0  a  a  a
    1  b  a  a
    2  b  b  b
    3  c  c  b
    4  c  c  c
    
    threshold = 1  # Remove items less than or equal to threshold
    for col in df:
        vc = df[col].value_counts()
        vals_to_remove = vc[vc <= threshold].index.values
        df[col].loc[df[col].isin(vals_to_remove)] = None
    
    >>> df
          A     B     C
    0  None     a     a
    1     b     a     a
    2     b  None     b
    3     c     c     b
    4     c     c  None
    
    0 讨论(0)
提交回复
热议问题