Find value counts within a pandas dataframe of strings

前端 未结 4 1423

I want to get the frequency count of strings within a column. One one hand, this is similar to collapsing a dataframe to a set of rows that only reflects the strings in the

相关标签:
4条回答
  • 2020-12-21 14:03

    You can use value counts and pd.Series (Thanks for improvement Jon)i.e

    ndf = df.apply(pd.Series.value_counts).fillna(0)
    
               2017-08-09  2017-08-10
    active_1             2         3.0
    active_1-3           1         0.0
    active_3-7           1         1.0
    pre                  1         1.0
    

    Timings:

    k = pd.concat([df]*1000)
    # @cᴏʟᴅsᴘᴇᴇᴅ's method 
    %%timeit
    pd.get_dummies(k.T).groupby(by=lambda x: x.split('_', 1)[1], axis=1).sum().T
    1 loop, best of 3: 5.68 s per loop
    
    
    %%timeit
    # @cᴏʟᴅsᴘᴇᴇᴅ's method 
    k.stack().str.get_dummies().sum(level=1).T
    10 loops, best of 3: 84.1 ms per loop
    
    # My method 
    %%timeit
    k.apply(pd.Series.value_counts).fillna(0)
    100 loops, best of 3: 7.57 ms per loop
    
    # FabienP's method 
    %%timeit
    k.unstack().groupby(level=0).value_counts().unstack().T.fillna(0)
    100 loops, best of 3: 7.35 ms per loop
    
    #@Wen's method (fastest for now) 
    pd.concat([pd.Series(collections.Counter(k[x])) for x in df.columns],axis=1)
    100 loops, best of 3: 4 ms per loop
    
    0 讨论(0)
  • 2020-12-21 14:08

    I do not know why I addict to using apply in this strange way ...

    df.apply(lambda x : x.groupby(x).count()).fillna(0)
    Out[31]: 
                2017-08-09  2017-08-10
    active_1             2         3.0
    active_1-3           1         0.0
    active_3-7           1         1.0
    pre                  1         1.0
    

    Or

    import collections
    df.apply(lambda x : pd.Series(collections.Counter(x))).fillna(0)
    

    As what I expected simple for loop is faster than apply

    pd.concat([pd.Series(collections.Counter(df[x])) for x in df.columns],axis=1)
    
    0 讨论(0)
  • 2020-12-21 14:13

    stack + get_dummies + sum:

    df.stack().str.get_dummies().sum(level=1).T
    
                2017-08-09  2017-08-10
    active_1             2           3
    active_1-3           1           0
    active_3-7           1           1
    pre                  1           1
    

    Very piR-esque if I do say so myself, elegance-wise, not speed-wise.


    Alternative with pd.get_dummies + groupby:

    pd.get_dummies(df.T).groupby(by=lambda x: x.split('_', 1)[1], axis=1).sum().T
    
                2017-08-09  2017-08-10
    active_1             2           3
    active_1-3           1           0
    active_3-7           1           1
    pre                  1           1
    
    0 讨论(0)
  • 2020-12-21 14:17

    Another solution using groupby and value_counts

    df.unstack().groupby(level=0).value_counts().unstack().T.fillna(0)
    Out[]:
                2017-08-09  2017-08-10
    active_1           2.0         3.0
    active_1-3         1.0         0.0
    active_3-7         1.0         1.0
    pre                1.0         1.0
    

    Or avoiding the last call to fillna

    df.unstack().groupby(level=0).value_counts().unstack(fill_value=0).T
    
    0 讨论(0)
提交回复
热议问题