Session generation from log file analysis with pandas

前端 未结 2 741
面向向阳花
面向向阳花 2020-12-15 11:08

I\'m analysing a Apache log file and I have imported it in to a pandas dataframe.

\'65.55.52.118 - - [30/May/2013:06:58:52 -0600] \"GET /detailedAddV

2条回答
  •  执念已碎
    2020-12-15 11:33

    Andy Hayden's answer is lovely and concise, but it gets very slow if you have a large number of users/IP addresses to group over. Here's another method that's much uglier but also much faster.

    import pandas as pd
    import numpy as np
    
    sample = lambda x: np.random.choice(x, size=10000)
    df = pd.DataFrame({'ip': sample(range(500)), 
                       'time': sample([1., 1.1, 1.2, 2.7, 3.2, 3.8, 3.9])})
    max_diff = 0.5 # Max time difference
    
    def method_1(df):
        df = df.sort_values('time')
        g = df.groupby('ip')
        df['session'] = g['time'].apply(
            lambda s: (s - s.shift(1) > max_diff).fillna(0).cumsum(skipna=False)
            )
        return df['session']
    
    
    def method_2(df):
        # Sort by ip then time 
        df = df.sort_values(['ip', 'time'])
    
        # Get locations where the ip changes 
        ip_change = df.ip != df.ip.shift()
        time_or_ip_change = (df.time - df.time.shift() > max_diff) | ip_change
        df['session'] = time_or_ip_change.cumsum()
    
        # The cumsum operated over the whole series, so subtract out the first 
        # value for each IP
        df['tmp'] = 0
        df.loc[ip_change, 'tmp'] = df.loc[ip_change, 'session']
        df['tmp'] = np.maximum.accumulate(df.tmp)
        df['session'] = df.session - df.tmp
    
        # Delete the temporary column
        del df['tmp']
        return df['session']
    
    r1 = method_1(df)
    r2 = method_2(df)
    
    assert (r1.sort_index() == r2.sort_index()).all()
    
    %timeit method_1(df)
    %timeit method_2(df)
    
    400 ms ± 195 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    11.6 ms ± 2.04 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

提交回复
热议问题