Vectorized way to count occurrences of string in either of two columns

后端未结

关注

 4  697

一整个雨季 2021-01-05 03:57

I have a problem that is similar to this question, but just different enough that it can\'t be solved with the same solution...

I\'ve got two dataframes,

4条回答

温柔的废话 (楼主)

2021-01-05 04:04

The "either" part complicates things, but should still be doable.

Option 1
Since other users decided to turn this into a speed-race, here's mine:

from collections import Counter
from itertools import chain

c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
df2['count'] = df2['ID'].map(Counter(c))
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

176 µs ± 7.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Option 2
(Original answer) stack based

c = df1.stack().groupby(level=0).value_counts().count(level=1)

Or,

c = df1.stack().reset_index(level=0).drop_duplicates()[0].value_counts()

Or,

v = df1.stack()
c = v.groupby([v.index.get_level_values(0), v]).count().count(level=1)
# c = v.groupby([v.index.get_level_values(0), v]).nunique().count(level=1)

And,

df2['count'] = df2.ID.map(c)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

Option 3
repeat-based Reshape and counting

v = pd.DataFrame({
        'i' : df1.values.reshape(-1, ), 
        'j' : df1.index.repeat(2)
    })
c = v.loc[~v.duplicated(), 'i'].value_counts()

df2['count'] = df2.ID.map(c)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

Option 4
concat + mask

v = pd.concat(
    [df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
).value_counts()

df2['count'] = df2.ID.map(v)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

0 讨论(0)

查看其它4个回答