Pandas - very slow performance when using stack(), groupby() and apply()

问题

I am having a very slow performance when calling stack, groupby and apply for a large dataframe in Pandas (1498829 rows). The code gives the differences of pairs (with this I mean the difference of xx's for all i2 at every i1).

The part of the code that is running slow is:

def get_diff(x):
    teams = x.index.get_level_values(1)
    tmp = pd.DataFrame(x[:,None]-x[None,:],
                       columns = teams.values,
                       index=teams.values).stack()
    return tmp[tmp.index.get_level_values(0)!=tmp.index.get_level_values(1)]

new_df = df.groupby('i1').xx.apply(get_diff).to_frame()

And the full code (with the data) is:

import pandas as pd
from random import randint


# data (it takes some time to create [less than 1 minute in my computer])
data1   = [[[[randint(0, 100) for i in range(randint(1, 2))] for i in range(randint(1, 3))] for i in range(5000)] for i in range(100)]
data2   = pd.DataFrame(
    [
        (i1, i2, i3, i4, x4)
        for (i1, x1) in enumerate(data1)
        for (i2, x2) in enumerate(x1)
        for (i3, x3) in enumerate(x2)
        for (i4, x4) in enumerate(x3)
    ],
    columns = ['i1', 'i2', 'i3', 'i4', 'xx']
)
data2.drop(['i3', 'i4'], axis=1, inplace = True)
df   = data2.set_index(['i1', 'i2']).sort_index()


## conflicting part of the code ##
def get_diff(x):
    teams = x.index.get_level_values(1)
    tmp = pd.DataFrame(x[:,None]-x[None,:],
                       columns = teams.values,
                       index=teams.values).stack()
    return tmp[tmp.index.get_level_values(0)!=tmp.index.get_level_values(1)]

new_df = df.groupby('i1').xx.apply(get_diff).to_frame()

If you could elaborate on the code to make it more efficient and execute faster, I would really appreciate it.

来源：https://stackoverflow.com/questions/60176200/pandas-very-slow-performance-when-using-stack-groupby-and-apply

标签

python

pandas

pandas-groupby