Fastest way to compare row and previous row in pandas dataframe with millions of rows

折月煮酒 提交于 2019-12-02 19:19:49

I was thinking along the same lines as Andy, just with groupby added, and I think this is complementary to Andy's answer. Adding groupby is just going to have the effect of putting a NaN in the first row whenever you do a diff or shift. (Note that this is not an attempt at an exact answer, just to sketch out some basic techniques.)

df['time_diff'] = df.groupby('User')['Time'].diff()

df['Col1_0'] = df['Col1'].apply( lambda x: x[0] )

df['Col1_0_prev'] = df.groupby('User')['Col1_0'].shift()

   User  Time                 Col1  time_diff Col1_0 Col1_0_prev
0     1     6     [cat, dog, goat]        NaN    cat         NaN
1     1     6         [cat, sheep]          0    cat         cat
2     1    12        [sheep, goat]          6  sheep         cat
3     2     3          [cat, lion]        NaN    cat         NaN
4     2     5  [fish, goat, lemur]          2   fish         cat
5     3     9           [cat, dog]        NaN    cat         NaN
6     4     4          [dog, goat]        NaN    dog         NaN
7     4    11                [cat]          7    cat         dog

As a followup to Andy's point about storing objects, note that what I did here was to extract the first element of the list column (and add a shifted version also). Doing it like this you only have to do an expensive extraction once and after that can stick to standard pandas methods.

Use pandas (constructs) and vectorize your code i.e. don't use for loops, instead use pandas/numpy functions.

'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1.

Calculate these separately:

df['newcol1'] = df['User'].shift() == df['User']
df.ix[0, 'newcol1'] = True # possibly tweak the first row??

df['newcol1'] = (df['Time'].shift() - df['Time']).abs() > 1

It's unclear to me the purpose of Col1, but general python objects in columns doesn't scale well (you can't use fast path and the contents are scattered in memory). Most of the time you can get away with using something else...


Cython is the very last option, and not needed in 99% of use-cases, but see enhancing performance section of the docs for tips.

In your problem, it seems like you want to iterate through row pairwise. The first thing you could do is something like this:

from itertools import tee, izip
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

for (idx1, row1), (idx2, row2) in pairwise(df.iterrows()):
    # you stuff

However you cannot modify row1 and row2 directly you will still need to use .loc or .iloc with the indexes.

If iterrows is still too slow I suggest to do something like this:

  • Create a user_id column from you unicode names using pd.unique(User) and mapping the name with a dictionary to integer ids.

  • Create a delta dataframe: to a shifted dataframe with the user_id and time column you substract the original dataframe.

    df[[col1, ..]].shift() - df[[col1, ..]])
    

If user_id > 0, it means that the user changed in two consecutive row. The time column can be filtered directly with delta[delta['time' > 1]] With this delta dataframe you record the changes row-wise. You can use it a a mask to update the columns you need from you original dataframe.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!