Vectorized way to count occurrences of string in either of two columns

后端 未结 4 697
一整个雨季
一整个雨季 2021-01-05 03:57

I have a problem that is similar to this question, but just different enough that it can\'t be solved with the same solution...

I\'ve got two dataframes,

4条回答
  •  温柔的废话
    2021-01-05 04:04

    The "either" part complicates things, but should still be doable.


    Option 1
    Since other users decided to turn this into a speed-race, here's mine:

    from collections import Counter
    from itertools import chain
    
    c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
    df2['count'] = df2['ID'].map(Counter(c))
    df2
    
             ID  count
    0      jack      3
    1      jill      5
    2      jane      8
    3       joe      9
    4       ben      7
    5  beatrice      6
    

    176 µs ± 7.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    Option 2
    (Original answer) stack based

    c = df1.stack().groupby(level=0).value_counts().count(level=1)
    

    Or,

    c = df1.stack().reset_index(level=0).drop_duplicates()[0].value_counts()
    

    Or,

    v = df1.stack()
    c = v.groupby([v.index.get_level_values(0), v]).count().count(level=1)
    # c = v.groupby([v.index.get_level_values(0), v]).nunique().count(level=1)
    

    And,

    df2['count'] = df2.ID.map(c)
    df2
    
             ID  count
    0      jack      3
    1      jill      5
    2      jane      8
    3       joe      9
    4       ben      7
    5  beatrice      6
    

    Option 3
    repeat-based Reshape and counting

    v = pd.DataFrame({
            'i' : df1.values.reshape(-1, ), 
            'j' : df1.index.repeat(2)
        })
    c = v.loc[~v.duplicated(), 'i'].value_counts()
    
    df2['count'] = df2.ID.map(c)
    df2
    
             ID  count
    0      jack      3
    1      jill      5
    2      jane      8
    3       joe      9
    4       ben      7
    5  beatrice      6
    

    Option 4
    concat + mask

    v = pd.concat(
        [df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
    ).value_counts()
    
    df2['count'] = df2.ID.map(v)
    df2
    
             ID  count
    0      jack      3
    1      jill      5
    2      jane      8
    3       joe      9
    4       ben      7
    5  beatrice      6
    

提交回复
热议问题