Indexing sequence chunks using data.table

Say I have a data set where sequences of length 1 are illegal, length 2 are legal, greater than length 5 are illegal but it is allowed to break longer sequences up into <=5 sequences.

set.seed(1)
DT1 <- data.table(smp = 1, R=sample(0:1, 20000, rep=TRUE), Seq = 0L)
DT1[, smp:=1:length(smp)]
DT1[, Seq:=seq(.N), by=list(cumsum(c(0, abs(diff(R)))))]

This last line comes directly from: Creating a sequence in a data.table depending on a column

DT1[, fix_min:=ifelse((R==TRUE & Seq==1) | (R==FALSE), FALSE, TRUE)]
fixmin_idx2 <- which(DT1[, fix_min==TRUE])
DT1[fixmin_idx2 -1, fix_min:=TRUE]

Now my length 2 legals are properly marked. Break up the >5s.

DT1[R==1 & Seq==6, fix_min:=FALSE]
DT1[,Seq2:=seq(.N), by=list(cumsum(c(0, abs(diff(fix_min)))))]
DT1[R==1 & Seq2==6, fix_min:=FALSE]
fixSeq2_idx7 <- which(DT1[,fix_min==TRUE] & DT1[,Seq2==7])
fixSeq2_idx7
[1] 10203 13228
DT1[fixSeq2_idx7,]
 smp R Seq fix_min Seq2
1: 10203 1  13    TRUE    7
2: 13228 1  13    TRUE    7
DT1[fixSeq2_idx7 + 1,]
 smp R Seq fix_min Seq2
1: 10204 1  14    TRUE    8
2: 13229 0   1   FALSE    1

And now to test if a Seq2==7 is followed by an Seq2==8, which would be a legal 2 length. I one 7 followed by an 8 and one not followed by an 8. And there I'm stuck. Everything I've tried either sets all fix_min to TRUE or alternation TRUE and FALSE.

Any guidance greatly appreciated.

If I understand your question correctly, you want to set the fix_min to FALSE when R == 0 or when R == 1 & (1 =< Seq < 6 | Seq > 6). Then the following should give you what you want:

# recreating the data from your first code block
set.seed(1)
DT1 <- data.table(R=sample(0:1, 20000, rep=TRUE))[, smp:=.I
                                                  ][, Seq:=seq(.N), by=rleid(R)
                                                    ][, Seq2 := Seq[.N], by=rleid(R)]

# adding the needed 'fix_min' column
DT1[, fix_min := (R==1 & Seq[.N] > 1 & Seq%%6!=0), by=rleid(R)
    ][R==1 & Seq%%6==1 & Seq2%%6==1 & Seq==Seq2, fix_min := FALSE]

Explanation:

data.table(R=sample(0:1, 20000, rep=TRUE)) creates the base of the data.table
[, smp:=.I] creates an index and adds it to the data.table
by=rleid(R) identifies the sequences; to see what it does try: data.table(R=sample(0:1, 20000, rep=TRUE))[, seq.id:=rleid(R)]
[, Seq:=seq(.N), by=rleid(R)] creates an index for each sequence and adds it to the data.table; the sequences are identified by rleid(R)
[, Seq2 := Seq[.N], by=rleid(R)] creates a variable that contains a value indicating the length of the sequence
fix_min := (R==1 & Seq[.N] > 1 & Seq%%6!=0) creates a logical vector with TRUE values where R==1 & the length of the sequence is larger than one (Seq[.N] > 1) excluding the values where the sequence number is a multiple of 6 (Seq%%6!=0)
R==1 & Seq%%6==1 & Seq2%%6==1 & Seq==Seq2 filters the data.table as follows: only rows where R==1 & the sequence value is 7, 13, 19, etc (Seq%%6==1) & the length of the sequence is 7, 13, 19, etc and only selects the last row (Seq==Seq2) from the sequences that meet the other conditions. With fix_min := FALSE you set them to FALSE.

来源：https://stackoverflow.com/questions/33416084/indexing-sequence-chunks-using-data-table

标签

indexing

data.table

Sequence

chunks