问题
I have a dataset with name (person_name), day and color (shirt_color) as columns.
Each person wears a shirt with a certain color on a particular day. The number of days can be arbitrary.
E.g. input:
name day color
----------------
John 1 White
John 2 White
John 3 Blue
John 4 Blue
John 5 White
Tom 2 White
Tom 3 Blue
Tom 4 Blue
Tom 5 Black
Jerry 1 Black
Jerry 2 Black
Jerry 4 Black
Jerry 5 White
I need to find the most frequently used color by each person.
E.g. result:
name color
-------------
Jerry Black
John White
Tom Blue
I am performing the following operation to get the results, which works fine but is quite slow:
most_frquent_list = [[name, group.color.mode()[0]]
for name, group in data.groupby('name')]
most_frquent_df = pd.DataFrame(most_frquent_list, columns=['name', 'color'])
Now suppose I have a dataset with 5 million unique names. What is the best/fastest way to perform the above operation?
回答1:
Numpy's numpy.add.at and pandas.factorize
This is intended to be fast. However, I tried to organize it to be readable as well.
i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)
b = np.zeros((n, m), dtype=np.int64)
np.add.at(b, (i, j), 1)
pd.Series(c[b.argmax(1)], r)
John White
Tom Blue
Jerry Black
dtype: object
groupby, size, and idxmax
df.groupby(['name', 'color']).size().unstack().idxmax(1)
name
Jerry Black
John White
Tom Blue
dtype: object
name
Jerry Black
John White
Tom Blue
Name: color, dtype: object
Counter
¯\_(ツ)_/¯
from collections import Counter
df.groupby('name').color.apply(lambda c: Counter(c).most_common(1)[0][0])
name
Jerry Black
John White
Tom Blue
Name: color, dtype: object
回答2:
Solution from pd.Series.mode
df.groupby('name').color.apply(pd.Series.mode).reset_index(level=1,drop=True)
Out[281]:
name
Jerry Black
John White
Tom Blue
Name: color, dtype: object
回答3:
UPDATE
It must be hard to beat this (~10 times faster on the sample daraframe than any proposed pandas solution and 1.5 faster than the proposed numpy solution). The gist is to stay away from pandas and use itertools.groupby which is doing a much better job when it concerns non-numerical data.
from itertools import groupby
from collections import Counter
pd.Series({x: Counter(z[-1] for z in y).most_common(1)[0][0] for x,y
in groupby(sorted(df.values.tolist()),
key=lambda x: x[0])})
# Jerry Black
# John White
# Tom Blue
Old Answer
Here's another method. It is actually slower than the original one, but I'll keep it here:
data.groupby('name')['color']\
.apply(pd.Series.value_counts)\
.unstack().idxmax(axis=1)
# name
# Jerry Black
# John White
# Tom Blue
回答4:
How about doing two groupings with transform(max)?
df = df.groupby(["name", "color"], as_index=False, sort=False).count()
idx = df.groupby("name", sort=False).transform(max)["day"] == df["day"]
df = df[idx][["name", "color"]].reset_index(drop=True)
Output:
name color
0 John White
1 Tom Blue
2 Jerry Black
回答5:
Similar to @piRSquared's pd.factorize and np.add.at ans.
We encode the stings in columns using
i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)
b = np.zeros((n, m), dtype=np.int64)
But then, instead of doing this:
np.add.at(b, (i, j), 1)
max_columns_after_add_at = b.argmax(1)
We get the max_columns_after_add_at using a jited function, to do add at and find maximum in the same loop:
@nb.jit(nopython=True, cache=True)
def add_at(x, rows, cols, val):
max_vals = np.zeros((x.shape[0], ), np.int64)
max_inds = np.zeros((x.shape[0], ), np.int64)
for i in range(len(rows)):
r = rows[i]
c = cols[i]
x[r, c]+=1
if(x[r, c] > max_vals[r]):
max_vals[r] = x[r, c]
max_inds[r] = c
return max_inds
And then get the dataframe in the end,
ans = pd.Series(c[max_columns_after_add_at], r)
So, the difference is how we do argmax(axis=1) after np.add.at().
Timing analysis
import numpy as np
import numba as nb
m = 100000
n = 100000
rows = np.random.randint(low = 0, high = m, size=10000)
cols = np.random.randint(low = 0, high = n, size=10000)
So this:
%%time
x = np.zeros((m,n))
np.add.at(x, (rows, cols), 1)
maxs = x.argmax(1)
gives:
CPU times: user 12.4 s, sys: 38 s, total: 50.4 s Wall time: 50.5 s
And this
%%time
x = np.zeros((m,n))
maxs2 = add_at(x, rows, cols, 1)
gives
CPU times: user 108 ms, sys: 39.4 s, total: 39.5 s Wall time: 38.4 s
来源:https://stackoverflow.com/questions/51975512/faster-alternative-to-perform-pandas-groupby-operation