How to count the number of occurrences in either of two columns

后端未结

关注

 2  487

I have a simple looking problem. I have a dataframe df with two columns. For each of the strings that occurs in either of these columns I would like to count th

相关标签:

2条回答

后悔当初

2021-01-16 18:51

You can use loc to filter out row level matches from 'col2', append the filtered 'col2' values to 'col1', and then call value_counts:

counts = df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()

The resulting output:

Note: You can add .sort_index() to the end of the counting code if you want the output to appear in alphabetical order.

Timings

Using the following setup to produce a larger sample dataset:

from string import ascii_lowercase

n = 10**5
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['col1', 'col2'])

def edchum(df):
    vals = np.unique(df.values)
    count = np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
    return count

I get the following timings:

%timeit df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
10 loops, best of 3: 19.7 ms per loop

%timeit edchum(df)
1 loop, best of 3: 3.81 s per loop

0 讨论(0)

名媛妹妹

2021-01-16 19:16

OK this is much trickier than I thought, not sure how this will scale but if you have a lot of repeating values then it will be more efficient than your current method, basically we can use str.get_dummies and reindex the columns from that result to generate a dummies df for all unique values, we can then use np.maximal on the 2 dfs and sum these:

In [77]:
t="""col1 col2
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h"""
df = pd.read_csv(io.StringIO(t), delim_whitespace=True)
np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()

Out[77]:
a    2
b    1
c    1
d    3
e    1
g    1
h    3
i    4
j    1
k    1
dtype: float64

vals here is just the unique values:

In [80]:
vals = np.unique(df.values)
vals

Out[80]:
array(['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'j', 'k'], dtype=object)

0 讨论(0)