Count consecutive ones in a dataframe and get indices where this occurs

问题

I have a pandas.DataFrame with integer column names, which has zeroes and ones. An example of the input:

    12  13  14  15
1   0   0   1   0
2   0   0   1   1
3   1   0   0   1
4   1   1   0   1
5   1   1   1   0
6   0   0   1   0
7   0   0   1   1
8   1   1   0   1
9   0   0   1   1
10  0   0   1   1
11  1   1   0   1
12  1   1   1   1
13  1   1   1   1
14  1   0   1   1
15  0   0   1   1

I need to count all consecutive ones which has a length/sum which is >=2, iterating through columns and returning also indices where an array of the consecutive ones occurs (start, end).

The preferred output would be a 3D DataFrame, where subcolumns "count" and "indices" refer to integer column names from the input.

An example output would look like this one:

12              13              14              15
count   indices count   indices count   indices count   indices
    3     (3,5)     2     (4,5)     2     (1,2)     3     (2,4)
    4   (11,14)     3   (11,13)     3     (5,7)     9    (7,15)
                                    2    (9,10) 
                                    4   (12,15)

I suppose it should be solved with itertools.groupby, but still can't figure out how to apply it to such problem, where both groupby results and its indices are being extracted.

回答1:

Here is one way to calculate the desired run lengths:

Code:

def min_run_length(series):
    terminal = pd.Series([0])
    diffs = pd.concat([terminal, series, terminal]).diff()
    starts = np.where(diffs == 1)
    ends = np.where(diffs == -1)
    return [(e-s, (s, e-1)) for s, e in zip(starts[0], ends[0])
            if e - s >= 2]

Test Code:

df = pd.read_fwf(StringIO(u"""
    12  13  14  15
    0   0   1   0
    0   0   1   1
    1   0   0   1
    1   1   0   1
    1   1   1   0
    0   0   1   0
    0   0   1   1
    1   1   0   1
    0   0   1   1
    0   0   1   1
    1   1   0   1
    1   1   1   1
    1   1   1   1
    1   0   1   1
    0   0   1   1"""), header=1)
print(df.dtypes)

indices = {cname: min_run_length(df[cname]) for cname in df.columns}
print(indices)

Results:

{
 u'12': [(3, (3, 5)), (4, (11, 14))], 
 u'13': [(2, (4, 5)), (3, (11, 13))], 
 u'14': [(2, (1, 2)), (3, (5, 7)), (2, (9, 10)), (4, (12, 15))]
 u'15': [(3, (2, 4)), (9, (7, 15))], 
}

来源：https://stackoverflow.com/questions/43986045/count-consecutive-ones-in-a-dataframe-and-get-indices-where-this-occurs

标签

python

pandas

itertools