Session generation from log file analysis with pandas

前端未结

关注

 2  752

面向向阳花 2020-12-15 11:08

I\'m analysing a Apache log file and I have imported it in to a pandas dataframe.

\'65.55.52.118 - - [30/May/2013:06:58:52 -0600] \"GET /detailedAddV

2条回答

执念已碎 (楼主)

2020-12-15 11:33

Andy Hayden's answer is lovely and concise, but it gets very slow if you have a large number of users/IP addresses to group over. Here's another method that's much uglier but also much faster.

import pandas as pd
import numpy as np

sample = lambda x: np.random.choice(x, size=10000)
df = pd.DataFrame({'ip': sample(range(500)), 
                   'time': sample([1., 1.1, 1.2, 2.7, 3.2, 3.8, 3.9])})
max_diff = 0.5 # Max time difference

def method_1(df):
    df = df.sort_values('time')
    g = df.groupby('ip')
    df['session'] = g['time'].apply(
        lambda s: (s - s.shift(1) > max_diff).fillna(0).cumsum(skipna=False)
        )
    return df['session']


def method_2(df):
    # Sort by ip then time 
    df = df.sort_values(['ip', 'time'])

    # Get locations where the ip changes 
    ip_change = df.ip != df.ip.shift()
    time_or_ip_change = (df.time - df.time.shift() > max_diff) | ip_change
    df['session'] = time_or_ip_change.cumsum()

    # The cumsum operated over the whole series, so subtract out the first 
    # value for each IP
    df['tmp'] = 0
    df.loc[ip_change, 'tmp'] = df.loc[ip_change, 'session']
    df['tmp'] = np.maximum.accumulate(df.tmp)
    df['session'] = df.session - df.tmp

    # Delete the temporary column
    del df['tmp']
    return df['session']

r1 = method_1(df)
r2 = method_2(df)

assert (r1.sort_index() == r2.sort_index()).all()

%timeit method_1(df)
%timeit method_2(df)

400 ms ± 195 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
11.6 ms ± 2.04 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

0 讨论(0)

查看其它2个回答