I\'m trying to remove entries from a data frame which occur less than 100 times.
The data frame data looks like this:
pid tag
1 23
1
Here are some run times for a couple of the solutions posted here, along with one that was not (using value_counts()) that is much faster than the other solutions:
import pandas as pd
import numpy as np
# Generate some 'users'
np.random.seed(42)
df = pd.DataFrame({'uid': np.random.randint(0, 500, 500)})
# Prove that some entries are 1
print "{:,} users only occur once in dataset".format(sum(df.uid.value_counts() == 1))
171 users only occur once in dataset
%%timeit
df.groupby(by='uid').filter(lambda x: len(x) > 1)
%%timeit
df[df.groupby('uid').uid.transform(len) > 1]
%%timeit
vc = df.uid.value_counts()
df[df.uid.isin(vc.index[vc.values > 1])].uid.value_counts()
10 loops, best of 3: 46.2 ms per loop
10 loops, best of 3: 30.1 ms per loop
1000 loops, best of 3: 1.27 ms per loop