Python: Removing Rows on Count condition

前端 未结 4 1094
攒了一身酷
攒了一身酷 2020-12-09 10:07

I have a problem filtering a pandas dataframe.

city 
NYC 
NYC 
NYC 
NYC 
SYD 
SYD 
SEL 
SEL
...

df.city.value_counts()

I woul

相关标签:
4条回答
  • 2020-12-09 10:20

    Here you go with filter

    df.groupby('city').filter(lambda x : len(x)>3)
    Out[1743]: 
      city
    0  NYC
    1  NYC
    2  NYC
    3  NYC
    

    Solution two transform

    sub_df = df[df.groupby('city').city.transform('count')>3].copy() 
    # add copy for future warning when you need to modify the sub df
    
    0 讨论(0)
  • 2020-12-09 10:31

    Another solution :

    threshold=3
    df['Count'] = df.groupby('City')['City'].transform(pd.Series.value_counts)
    df=df[df['Count']>=threshold]
    df.drop(['Count'], axis = 1, inplace = True)
    print(df)
    
      City
    0  NYC
    1  NYC
    2  NYC
    3  NYC
    
    0 讨论(0)
  • 2020-12-09 10:32

    I think you're looking for value_counts()

    # Import the great and powerful pandas
    import pandas as pd
    
    # Create some example data
    df = pd.DataFrame({
        'city': ['NYC', 'NYC', 'SYD', 'NYC', 'SEL', 'NYC', 'NYC']
    })
    
    # Get the count of each value
    value_counts = df['city'].value_counts()
    
    # Select the values where the count is less than 3 (or 5 if you like)
    to_remove = value_counts[value_counts <= 3].index
    
    # Keep rows where the city column is not in to_remove
    df = df[~df.city.isin(to_remove)]
    
    0 讨论(0)
  • 2020-12-09 10:37

    This is one way using pd.Series.value_counts.

    counts = df['city'].value_counts()
    
    res = df[~df['city'].isin(counts[counts < 5].index)]
    

    counts is a pd.Series object. counts < 5 returns a Boolean series. We filter the counts series by the Boolean counts < 5 series (that's what the square brackets achieve). We then take the index of the resultant series to find the cities with < 5 counts. ~ is the negation operator.

    Remember a series is a mapping between index and value. The index of a series does not necessarily contain unique values, but this is guaranteed with the output of value_counts.

    0 讨论(0)
提交回复
热议问题