How to count the number of occurrences in either of two columns

后端 未结 2 484
情歌与酒
情歌与酒 2021-01-16 18:47

I have a simple looking problem. I have a dataframe df with two columns. For each of the strings that occurs in either of these columns I would like to count th

相关标签:
2条回答
  • 2021-01-16 18:51

    You can use loc to filter out row level matches from 'col2', append the filtered 'col2' values to 'col1', and then call value_counts:

    counts = df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
    

    The resulting output:

    i    4
    d    3
    h    3
    a    2
    j    1
    k    1
    c    1
    g    1
    b    1
    e    1
    

    Note: You can add .sort_index() to the end of the counting code if you want the output to appear in alphabetical order.

    Timings

    Using the following setup to produce a larger sample dataset:

    from string import ascii_lowercase
    
    n = 10**5
    data = np.random.choice(list(ascii_lowercase), size=(n,2))
    df = pd.DataFrame(data, columns=['col1', 'col2'])
    
    def edchum(df):
        vals = np.unique(df.values)
        count = np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
        return count
    

    I get the following timings:

    %timeit df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
    10 loops, best of 3: 19.7 ms per loop
    
    %timeit edchum(df)
    1 loop, best of 3: 3.81 s per loop
    
    0 讨论(0)
  • 2021-01-16 19:16

    OK this is much trickier than I thought, not sure how this will scale but if you have a lot of repeating values then it will be more efficient than your current method, basically we can use str.get_dummies and reindex the columns from that result to generate a dummies df for all unique values, we can then use np.maximal on the 2 dfs and sum these:

    In [77]:
    t="""col1 col2
    g k
    a h
    c i
    j e
    d i
    i h
    b b
    d d
    i a
    d h"""
    df = pd.read_csv(io.StringIO(t), delim_whitespace=True)
    np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
    
    Out[77]:
    a    2
    b    1
    c    1
    d    3
    e    1
    g    1
    h    3
    i    4
    j    1
    k    1
    dtype: float64
    

    vals here is just the unique values:

    In [80]:
    vals = np.unique(df.values)
    vals
    
    Out[80]:
    array(['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'j', 'k'], dtype=object)
    
    0 讨论(0)
提交回复
热议问题