I have a simple looking problem. I have a dataframe df
with two columns. For each of the strings that occurs in either of these columns I would like to count th
You can use loc
to filter out row level matches from 'col2'
, append the filtered 'col2'
values to 'col1'
, and then call value_counts
:
counts = df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
The resulting output:
i 4
d 3
h 3
a 2
j 1
k 1
c 1
g 1
b 1
e 1
Note: You can add .sort_index()
to the end of the counting code if you want the output to appear in alphabetical order.
Timings
Using the following setup to produce a larger sample dataset:
from string import ascii_lowercase
n = 10**5
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['col1', 'col2'])
def edchum(df):
vals = np.unique(df.values)
count = np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
return count
I get the following timings:
%timeit df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
10 loops, best of 3: 19.7 ms per loop
%timeit edchum(df)
1 loop, best of 3: 3.81 s per loop
OK this is much trickier than I thought, not sure how this will scale but if you have a lot of repeating values then it will be more efficient than your current method, basically we can use str.get_dummies
and reindex the columns from that result to generate a dummies df for all unique values, we can then use np.maximal
on the 2 dfs and sum
these:
In [77]:
t="""col1 col2
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h"""
df = pd.read_csv(io.StringIO(t), delim_whitespace=True)
np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
Out[77]:
a 2
b 1
c 1
d 3
e 1
g 1
h 3
i 4
j 1
k 1
dtype: float64
vals here is just the unique values:
In [80]:
vals = np.unique(df.values)
vals
Out[80]:
array(['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'j', 'k'], dtype=object)