Python pandas: exclude rows below a certain frequency count

前端 未结 3 1000
面向向阳花
面向向阳花 2020-12-08 05:29

So I have a pandas DataFrame that looks like this:

r vals    positions
1.2       1
1.8       2
2.3       1
1.8       1
2.1       3
2.0       3
1.9       1
..         


        
相关标签:
3条回答
  • 2020-12-08 05:54

    How about selecting all position rows with values >= 20

    mask = df['position'] >= 20
    sel = df.ix[mask, :]
    
    0 讨论(0)
  • 2020-12-08 06:11

    On your limited dataset the following works:

    In [125]:
    df.groupby('positions')['r vals'].filter(lambda x: len(x) >= 3)
    
    Out[125]:
    0    1.2
    2    2.3
    3    1.8
    6    1.9
    Name: r vals, dtype: float64
    

    You can assign the result of this filter and use this with isin to filter your orig df:

    In [129]:
    filtered = df.groupby('positions')['r vals'].filter(lambda x: len(x) >= 3)
    df[df['r vals'].isin(filtered)]
    
    Out[129]:
       r vals  positions
    0     1.2          1
    1     1.8          2
    2     2.3          1
    3     1.8          1
    6     1.9          1
    

    You just need to change 3 to 20 in your case

    Another approach would be to use value_counts to create an aggregate series, we can then use this to filter your df:

    In [136]:
    counts = df['positions'].value_counts()
    counts
    
    Out[136]:
    1    4
    3    2
    2    1
    dtype: int64
    
    In [137]:
    counts[counts > 3]
    
    Out[137]:
    1    4
    dtype: int64
    
    In [135]:
    df[df['positions'].isin(counts[counts > 3].index)]
    
    Out[135]:
       r vals  positions
    0     1.2          1
    2     2.3          1
    3     1.8          1
    6     1.9          1
    

    EDIT

    If you want to filter the groupby object on the dataframe rather than a Series then you can call filter on the groupby object directly:

    In [139]:
    filtered = df.groupby('positions').filter(lambda x: len(x) >= 3)
    filtered
    
    Out[139]:
       r vals  positions
    0     1.2          1
    2     2.3          1
    3     1.8          1
    6     1.9          1
    
    0 讨论(0)
  • 2020-12-08 06:11

    I like the following method:

    def filter_by_freq(df: pd.DataFrame, column: str, min_freq: int) -> pd.DataFrame:
        """Filters the DataFrame based on the value frequency in the specified column.
    
        :param df: DataFrame to be filtered.
        :param column: Column name that should be frequency filtered.
        :param min_freq: Minimal value frequency for the row to be accepted.
        :return: Frequency filtered DataFrame.
        """
        # Frequencies of each value in the column.
        freq = df[column].value_counts()
        # Select frequent values. Value is in the index.
        frequent_values = freq[freq >= min_freq].index
        # Return only rows with value frequency above threshold.
        return df[df[column].isin(frequent_values)]
    

    It is much faster than the filter lambda method in the accepted answer - python overhead is minimised.

    0 讨论(0)
提交回复
热议问题