Say I have a data set where sequences of length 1 are illegal, length 2 are legal, greater than length 5 are illegal but it is allowed to break longer sequences up into <=5 sequences.
set.seed(1)
DT1 <- data.table(smp = 1, R=sample(0:1, 20000, rep=TRUE), Seq = 0L)
DT1[, smp:=1:length(smp)]
DT1[, Seq:=seq(.N), by=list(cumsum(c(0, abs(diff(R)))))]
This last line comes directly from: Creating a sequence in a data.table depending on a column
DT1[, fix_min:=ifelse((R==TRUE & Seq==1) | (R==FALSE), FALSE, TRUE)]
fixmin_idx2 <- which(DT1[, fix_min==TRUE])
DT1[fixmin_idx2 -1, fix_min:=TRUE]
Now my length 2 legals are properly marked. Break up the >5s.
DT1[R==1 & Seq==6, fix_min:=FALSE]
DT1[,Seq2:=seq(.N), by=list(cumsum(c(0, abs(diff(fix_min)))))]
DT1[R==1 & Seq2==6, fix_min:=FALSE]
fixSeq2_idx7 <- which(DT1[,fix_min==TRUE] & DT1[,Seq2==7])
fixSeq2_idx7
[1] 10203 13228
DT1[fixSeq2_idx7,]
smp R Seq fix_min Seq2
1: 10203 1 13 TRUE 7
2: 13228 1 13 TRUE 7
DT1[fixSeq2_idx7 + 1,]
smp R Seq fix_min Seq2
1: 10204 1 14 TRUE 8
2: 13229 0 1 FALSE 1
And now to test if a Seq2==7 is followed by an Seq2==8, which would be a legal 2 length. I one 7 followed by an 8 and one not followed by an 8. And there I'm stuck. Everything I've tried either sets all fix_min to TRUE or alternation TRUE and FALSE.
Any guidance greatly appreciated.
If I understand your question correctly, you want to set the fix_min
to FALSE
when R == 0
or when R == 1 & (1 =< Seq < 6 | Seq > 6)
. Then the following should give you what you want:
# recreating the data from your first code block
set.seed(1)
DT1 <- data.table(R=sample(0:1, 20000, rep=TRUE))[, smp:=.I
][, Seq:=seq(.N), by=rleid(R)
][, Seq2 := Seq[.N], by=rleid(R)]
# adding the needed 'fix_min' column
DT1[, fix_min := (R==1 & Seq[.N] > 1 & Seq%%6!=0), by=rleid(R)
][R==1 & Seq%%6==1 & Seq2%%6==1 & Seq==Seq2, fix_min := FALSE]
Explanation:
data.table(R=sample(0:1, 20000, rep=TRUE))
creates the base of the data.table[, smp:=.I]
creates an index and adds it to the data.tableby=rleid(R)
identifies the sequences; to see what it does try:data.table(R=sample(0:1, 20000, rep=TRUE))[, seq.id:=rleid(R)]
[, Seq:=seq(.N), by=rleid(R)]
creates an index for each sequence and adds it to the data.table; the sequences are identified byrleid(R)
[, Seq2 := Seq[.N], by=rleid(R)]
creates a variable that contains a value indicating the length of the sequencefix_min := (R==1 & Seq[.N] > 1 & Seq%%6!=0)
creates a logical vector withTRUE
values whereR==1
& the length of the sequence is larger than one (Seq[.N] > 1
) excluding the values where the sequence number is a multiple of6
(Seq%%6!=0
)R==1 & Seq%%6==1 & Seq2%%6==1 & Seq==Seq2
filters the data.table as follows: only rows whereR==1
& the sequence value is7
,13
,19
, etc (Seq%%6==1
) & the length of the sequence is7
,13
,19
, etc and only selects the last row (Seq==Seq2
) from the sequences that meet the other conditions. Withfix_min := FALSE
you set them toFALSE
.
来源:https://stackoverflow.com/questions/33416084/indexing-sequence-chunks-using-data-table