Arithmetic Progression series in R

耗尽温柔 提交于 2019-12-07 02:03:27

If you are looking for sequences of consecutive numbers, then cgwtools::seqle will find them for you in the same way rle finds a sequence of repeated values.

In the general case of basically any subset of your data which form such a sequence, such as the 8,10,12,14 case you cite, your criteria are so general as to be very difficult to satisfy. You'd have to start at each element of your series and do a forward-looking search for x[j] +1, x[j]+2, x[j]+3 ... ad infinitum. This suggests using some tree-based algorithms.

Here's a potential solution - albeit a very ugly, sloppy one:

##
arithSeq <- function(x=nSeq, minSize=4){
  ##
  dx <- diff(x,lag=1)
  Runs <- rle(diff(x))
  ##
  rLens <- Runs[[1]]
  rVals <- Runs[[2]]
  pStart <- c(
    rep(1,rLens[1]),
    rep(cumsum(1+rLens[-length(rLens)]),times=rLens[-1])
  )
  pEnd <- pStart + c(
    rep(rLens[1]-1, rLens[1]),
    rep(rLens[-1],times=rLens[-1])
  )
  pGrp <- rep(1:length(rLens),times=rLens)
  pLen <- rep(rLens, times=rLens)
  dAll <- data.frame(
    pStart=pStart,
    pEnd=pEnd,
    pGrp=pGrp,
    pLen=pLen,
    runVal=rep(rVals,rLens)
  )
  ##
  dSub <- subset(dAll, pLen >= minSize - 1)
  ##
  uVals <- unique(dSub$runVal)
  ##
  maxSub <- subset(dSub, runVal==uVals[1])
  maxLen <- max(maxSub$pLen)
  maxSub <- subset(maxSub, pLen==maxLen)
  ##
  if(length(uVals) > 1){
    for(i in 2:length(uVals)){
      iSub <- subset(dSub, runVal==uVals[i])
      iMaxLen <- max(iSub$pLen)
      iSub <- subset(iSub, pLen==iMaxLen)
      maxSub <- rbind(
        maxSub,
        iSub)
      maxSub
    }
    ##
  }
  ##
  deDup <- maxSub[!duplicated(maxSub),]
  seqStarts <- as.numeric(rownames(deDup))
  outList <- list(NULL); length(outList) <- nrow(deDup)
  for(i in 1:nrow(deDup)){
    outList[[i]] <- list(
      Sequence = x[seqStarts[i]:(seqStarts[i]+deDup[i,"pLen"])],
      Length=deDup[i,"pLen"]+1,
      StartPosition=seqStarts[i],
      EndPosition=seqStarts[i]+deDup[i,"pLen"])
    outList
  }
  ##
  return(outList)
  ##
}
##

So there are things that can definitely be improved in this function - for instance I made a mistake somewhere in the calculation of pStart and pEnd, the start and end indices of a given arithmetic sequence, but it just so happened that the true start positions of such sequences are given as the rownumbers of one of the intermediate data.frames, so that was a hacky sort of solution. Anyways, it accepts a numeric vector x and a minimum length parameter, minSize. It will return a list containing information about sequences meeting the criteria you outlined above.

set.seed(1234)
lSeq <- sample(1:25,100000,replace=TRUE)
nSeq <- c(1:10,12,33,13:17,16:26)
##
> arithSeq(nSeq)
[[1]]
[[1]]$Sequence
 [1] 16 17 18 19 20 21 22 23 24 25 26

[[1]]$Length
[1] 11

[[1]]$StartPosition
[1] 18

[[1]]$EndPosition
[1] 28
##
> arithSeq(x=lSeq,minSize=5)
[[1]]
[[1]]$Sequence
[1] 13 16 19 22 25

[[1]]$Length
[1] 5

[[1]]$StartPosition
[1] 12760

[[1]]$EndPosition
[1] 12764


[[2]]
[[2]]$Sequence
[1] 11 13 15 17 19

[[2]]$Length
[1] 5

[[2]]$StartPosition
[1] 37988

[[2]]$EndPosition
[1] 37992

Like I said, its sloppy and inelegant, but it should get you started.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!