Identifying consecutive occurrences of a value

后端 未结 2 1548
面向向阳花
面向向阳花 2020-12-09 11:34

I have a df like so:

Count
1
0
1
1
0
0
1
1
1
0

and I want to return a 1 in a new column if there are two or more consecutive o

相关标签:
2条回答
  • 2020-12-09 11:55

    Not sure if this is optimized, but you can give it a try:

    from itertools import groupby
    import pandas as pd
    
    l = []
    for k, g in groupby(df.Count):
        size = sum(1 for _ in g)
        if k == 1 and size >= 2:
            l = l + [1]*size
        else:
            l = l + [0]*size
    
    df['new_Value'] = pd.Series(l)
    
    df
    
    Count   new_Value
    0   1   0
    1   0   0
    2   1   1
    3   1   1
    4   0   0
    5   0   0
    6   1   1
    7   1   1
    8   1   1
    9   0   0
    
    0 讨论(0)
  • 2020-12-09 12:06

    You could:

    df['consecutive'] = df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
    

    to get:

       Count  consecutive
    0      1            1
    1      0            0
    2      1            2
    3      1            2
    4      0            0
    5      0            0
    6      1            3
    7      1            3
    8      1            3
    9      0            0
    

    From here you can, for any threshold:

    threshold = 2
    df['consecutive'] = (df.consecutive > threshold).astype(int)
    

    to get:

       Count  consecutive
    0      1            0
    1      0            0
    2      1            1
    3      1            1
    4      0            0
    5      0            0
    6      1            1
    7      1            1
    8      1            1
    9      0            0
    

    or, in a single step:

    (df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
    

    In terms of efficiency, using pandas methods provides a significant speedup when the size of the problem grows:

     df = pd.concat([df for _ in range(1000)])
    
    %timeit (df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
    1000 loops, best of 3: 1.47 ms per loop
    

    compared to:

    %%timeit
    l = []
    for k, g in groupby(df.Count):
        size = sum(1 for _ in g)
        if k == 1 and size >= 2:
            l = l + [1]*size
        else:
            l = l + [0]*size    
    pd.Series(l)
    
    10 loops, best of 3: 76.7 ms per loop
    
    0 讨论(0)
提交回复
热议问题