Session generation from log file analysis with pandas

前端 未结 2 739
面向向阳花
面向向阳花 2020-12-15 11:08

I\'m analysing a Apache log file and I have imported it in to a pandas dataframe.

\'65.55.52.118 - - [30/May/2013:06:58:52 -0600] \"GET /detailedAddV

相关标签:
2条回答
  • 2020-12-15 11:33

    Andy Hayden's answer is lovely and concise, but it gets very slow if you have a large number of users/IP addresses to group over. Here's another method that's much uglier but also much faster.

    import pandas as pd
    import numpy as np
    
    sample = lambda x: np.random.choice(x, size=10000)
    df = pd.DataFrame({'ip': sample(range(500)), 
                       'time': sample([1., 1.1, 1.2, 2.7, 3.2, 3.8, 3.9])})
    max_diff = 0.5 # Max time difference
    
    def method_1(df):
        df = df.sort_values('time')
        g = df.groupby('ip')
        df['session'] = g['time'].apply(
            lambda s: (s - s.shift(1) > max_diff).fillna(0).cumsum(skipna=False)
            )
        return df['session']
    
    
    def method_2(df):
        # Sort by ip then time 
        df = df.sort_values(['ip', 'time'])
    
        # Get locations where the ip changes 
        ip_change = df.ip != df.ip.shift()
        time_or_ip_change = (df.time - df.time.shift() > max_diff) | ip_change
        df['session'] = time_or_ip_change.cumsum()
    
        # The cumsum operated over the whole series, so subtract out the first 
        # value for each IP
        df['tmp'] = 0
        df.loc[ip_change, 'tmp'] = df.loc[ip_change, 'session']
        df['tmp'] = np.maximum.accumulate(df.tmp)
        df['session'] = df.session - df.tmp
    
        # Delete the temporary column
        del df['tmp']
        return df['session']
    
    r1 = method_1(df)
    r2 = method_2(df)
    
    assert (r1.sort_index() == r2.sort_index()).all()
    
    %timeit method_1(df)
    %timeit method_2(df)
    
    400 ms ± 195 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    11.6 ms ± 2.04 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    0 讨论(0)
  • 2020-12-15 11:57

    I would do this using a shift and a cumsum (here's a simple example, with numbers instead of times - but they would work exactly the same):

    In [11]: s = pd.Series([1., 1.1, 1.2, 2.7, 3.2, 3.8, 3.9])
    
    In [12]: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False)  # *
    Out[12]:
    0    0
    1    0
    2    0
    3    1
    4    1
    5    2
    6    2
    dtype: int64
    

    * the need for skipna=False appears to be a bug.

    Then you can use this in a groupby apply:

    In [21]: df = pd.DataFrame([[1.1, 1.7, 2.5, 2.6, 2.7, 3.4], list('AAABBB')]).T
    
    In [22]: df.columns = ['time', 'ip']
    
    In [23]: df
    Out[23]:
      time ip
    0  1.1  A
    1  1.7  A
    2  2.5  A
    3  2.6  B
    4  2.7  B
    5  3.4  B
    
    In [24]: g = df.groupby('ip')
    
    In [25]: df['session_number'] = g['time'].apply(lambda s: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False))
    
    In [26]: df
    Out[26]:
      time ip  session_number
    0  1.1  A               0
    1  1.7  A               1
    2  2.5  A               2
    3  2.6  B               0
    4  2.7  B               0
    5  3.4  B               1
    

    Now you can groupby 'ip' and 'session_number' (and analyse each session).

    0 讨论(0)
提交回复
热议问题