I want to get the frequency count of strings within a column. One one hand, this is similar to collapsing a dataframe to a set of rows that only reflects the strings in the
You can use value counts
and pd.Series
(Thanks for improvement Jon)i.e
ndf = df.apply(pd.Series.value_counts).fillna(0)
2017-08-09 2017-08-10 active_1 2 3.0 active_1-3 1 0.0 active_3-7 1 1.0 pre 1 1.0
Timings:
k = pd.concat([df]*1000)
# @cᴏʟᴅsᴘᴇᴇᴅ's method
%%timeit
pd.get_dummies(k.T).groupby(by=lambda x: x.split('_', 1)[1], axis=1).sum().T
1 loop, best of 3: 5.68 s per loop
%%timeit
# @cᴏʟᴅsᴘᴇᴇᴅ's method
k.stack().str.get_dummies().sum(level=1).T
10 loops, best of 3: 84.1 ms per loop
# My method
%%timeit
k.apply(pd.Series.value_counts).fillna(0)
100 loops, best of 3: 7.57 ms per loop
# FabienP's method
%%timeit
k.unstack().groupby(level=0).value_counts().unstack().T.fillna(0)
100 loops, best of 3: 7.35 ms per loop
#@Wen's method (fastest for now)
pd.concat([pd.Series(collections.Counter(k[x])) for x in df.columns],axis=1)
100 loops, best of 3: 4 ms per loop
I do not know why I addict to using apply
in this strange way ...
df.apply(lambda x : x.groupby(x).count()).fillna(0)
Out[31]:
2017-08-09 2017-08-10
active_1 2 3.0
active_1-3 1 0.0
active_3-7 1 1.0
pre 1 1.0
Or
import collections
df.apply(lambda x : pd.Series(collections.Counter(x))).fillna(0)
As what I expected simple for loop is faster than apply
pd.concat([pd.Series(collections.Counter(df[x])) for x in df.columns],axis=1)
stack
+ get_dummies
+ sum
:
df.stack().str.get_dummies().sum(level=1).T
2017-08-09 2017-08-10
active_1 2 3
active_1-3 1 0
active_3-7 1 1
pre 1 1
Very piR-esque if I do say so myself, elegance-wise, not speed-wise.
Alternative with pd.get_dummies
+ groupby
:
pd.get_dummies(df.T).groupby(by=lambda x: x.split('_', 1)[1], axis=1).sum().T
2017-08-09 2017-08-10
active_1 2 3
active_1-3 1 0
active_3-7 1 1
pre 1 1
Another solution using groupby
and value_counts
df.unstack().groupby(level=0).value_counts().unstack().T.fillna(0)
Out[]:
2017-08-09 2017-08-10
active_1 2.0 3.0
active_1-3 1.0 0.0
active_3-7 1.0 1.0
pre 1.0 1.0
Or avoiding the last call to fillna
df.unstack().groupby(level=0).value_counts().unstack(fill_value=0).T