Vectorized way to count occurrences of string in either of two columns

后端 未结 4 694
一整个雨季
一整个雨季 2021-01-05 03:57

I have a problem that is similar to this question, but just different enough that it can\'t be solved with the same solution...

I\'ve got two dataframes,

4条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-01-05 04:28

    Below are a couple of ways based on numpy arrays. Benchmarking below.

    Important: Take these results with a grain of salt. Remember, performance is dependent on your data, environment and hardware. In your choice, you should also consider readability / adaptability.

    Categorical data: The superb performance with categorical data in jp2 (i.e. factorising strings to integers via an internal dictionary-like structure) is data-dependent, but if it works it should be applicable across all the below algorithms with good performance and memory benefits.

    import pandas as pd
    import numpy as np
    from itertools import chain
    from collections import Counter
    
    # Tested on python 3.6.2 / pandas 0.20.3 / numpy 1.13.1
    
    %timeit original(df1, df2)   # 48.4 ms per loop
    %timeit jp1(df1, df2)        # 5.82 ms per loop
    %timeit jp2(df1, df2)        # 2.20 ms per loop
    %timeit brad(df1, df2)       # 7.83 ms per loop
    %timeit cs1(df1, df2)        # 12.5 ms per loop
    %timeit cs2(df1, df2)        # 17.4 ms per loop
    %timeit cs3(df1, df2)        # 15.7 ms per loop
    %timeit cs4(df1, df2)        # 10.7 ms per loop
    %timeit wen1(df1, df2)       # 19.7 ms per loop
    %timeit wen2(df1, df2)       # 32.8 ms per loop
    
    def original(df1, df2):
        for idx,row in df2.iterrows():
            df2.loc[idx, 'count'] = len(df1[(df1.ID_a == row.ID) | (df1.ID_b == row.ID)])
        return df2
    
    def jp1(df1, df2):
        for idx, item in enumerate(df2['ID']):
            df2.iat[idx, 1] = np.sum((df1.ID_a.values == item) | (df1.ID_b.values == item))
        return df2
    
    def jp2(df1, df2):
        df2['ID'] = df2['ID'].astype('category')
        df1['ID_a'] = df1['ID_a'].astype('category')
        df1['ID_b'] = df1['ID_b'].astype('category')
        for idx, item in enumerate(df2['ID']):
            df2.iat[idx, 1] = np.sum((df1.ID_a.values == item) | (df1.ID_b.values == item))
        return df2
    
    def brad(df1, df2):
        names1, names2 = df1.values.T
        v2 = df2.ID.values
        mask1 = v2 == names1[:, None]
        mask2 = v2 == names2[:, None]
        df2['count'] = np.logical_or(mask1, mask2).sum(axis=0)
        return df2
    
    def cs1(df1, df2):
        c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
        df2['count'] = df2['ID'].map(Counter(c))
        return df2
    
    def cs2(df1, df2):
        v = df1.stack().groupby(level=0).value_counts().count(level=1)
        df2['count'] = df2.ID.map(v)
        return df2
    
    def cs3(df1, df2):
        v = pd.DataFrame({
                'i' : df1.values.reshape(-1, ), 
                'j' : df1.index.repeat(2)
            })
        c = v.loc[~v.duplicated(), 'i'].value_counts()
    
        df2['count'] = df2.ID.map(c)
        return df2
    
    def cs4(df1, df2):
        v = pd.concat(
            [df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
        ).value_counts()
    
        df2['count'] = df2.ID.map(v)
        return df2
    
    def wen1(df1, df2):
        return pd.get_dummies(df1, prefix='', prefix_sep='').sum(level=0,axis=1).gt(0).sum().loc[df2.ID]
    
    def wen2(df1, df2):
        return pd.Series(Counter(list(chain(*list(map(set,df1.values)))))).loc[df2.ID]
    

    Setup

    import pandas as pd
    import numpy as np
    
    np.random.seed(42)
    
    names = ['jack', 'jill', 'jane', 'joe', 'ben', 'beatrice']
    
    df1 = pd.DataFrame({'ID_a':np.random.choice(names, 10000), 'ID_b':np.random.choice(names, 10000)})    
    
    df2 = pd.DataFrame({'ID':names})
    
    df2['count'] = 0
    

提交回复
热议问题