问题
I am looking for a way to remove rows from a dataframe that contain low frequency items. I adapted the following snippet from this post:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)
The problem is, that this code does not scale, it seems.
The line to_remove = value_counts[value_counts <= threshold].index has now been running for several hours for my data (2 GB compressed HDFStore). I therefore need a better solution. Ideally out-of-core. I suspect dask.dataframe is suitable, but I fail to express the above code in terms of dask. The key functions stack and replace are absent from dask.dataframe.
I tried the following (works in normal pandas) to work around the lack of these two functions:
value_countss = [df[col].value_counts() for col in df.columns]
infrequent_itemss = [value_counts[value_counts < 3] for value_counts in value_countss]
rows_to_drop = set(i for indices in [df.loc[df[col].isin(infrequent_items.keys())].index.values for col, infrequent_items in zip(df.columns, infrequent_itemss)] for i in indices)
df.drop(rows_to_drop)
That does not actually work with dask though. It errors at infrequent_items.keys().
Even if it did work, given that this is the opposite of elegant, I suspect there must be a better way.
Can you suggest something?
回答1:
Not sure if this will help you out, but it's too big for a comment:
df = pd.DataFrame(np.random.randint(0, high=20, size=(30,2)), columns = ['A', 'B'])
unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))
threshold = 10
to_remove = [k for k, v in d.items() if v < threshold]
df.replace(to_remove, np.nan, inplace=True)
See:
How to count the occurrence of certain item in an ndarray in Python?
how to count occurrence of each unique value in pandas
Toy problem showed a 40x speedup from 400 us to 10 us in the step you mentioned.
回答2:
The following code, which incorporates Evan's improvement, solves my issue:
unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))
to_remove = {k for k, v in d.items() if v < threshold}
mask = df.isin(to_remove)
column_mask = (~mask).all(axis=1)
df = df[column_mask]
demo:
def filter_low_frequency(df, threshold=4):
unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))
to_remove = {k for k, v in d.items() if v < threshold}
mask = df.isin(to_remove)
column_mask = (~mask).all(axis=1)
df = df[column_mask]
return df
df = pd.DataFrame(np.random.randint(0, high=20, size=(10,10)))
print(df)
print(df.stack().value_counts())
df = filter_low_frequency(df)
print(df)
0 1 2 3 4 5 6 7 8 9
0 3 17 11 13 8 8 15 14 7 8
1 2 14 11 3 16 10 19 19 14 4
2 8 13 13 17 3 13 17 18 5 18
3 7 8 14 9 15 12 0 15 2 19
4 6 12 13 11 16 6 19 16 2 17
5 2 1 2 17 1 3 12 10 2 16
6 0 19 9 4 15 3 3 3 4 0
7 18 8 15 9 1 18 15 17 9 0
8 17 15 9 11 13 9 11 4 19 8
9 13 6 7 8 8 10 0 3 16 13
8 9
3 8
13 8
17 7
15 7
19 6
2 6
9 6
11 5
16 5
0 5
18 4
4 4
14 4
10 3
12 3
7 3
6 3
1 3
5 1
dtype: int64
0 1 2 3 4 5 6 7 8 9
6 0 19 9 4 15 3 3 3 4 0
8 17 15 9 11 13 9 11 4 19 8
来源:https://stackoverflow.com/questions/53605193/dask-dataframe-or-alternative-scalable-way-of-dropping-rows-of-low-frequency-it