Arithmetic Progression series in R

问题

I am new to this forum. I guess something like this has been asked before but, I am not really sure if that is what I want.

I have a sequence like this,

1 2 3 4 5 8 9 10 12 14 15 17 18 19

So, what I wish to do is this, get all the numbers which form a series,i.e.the numbers that belonging to that set should all have a constant difference with the previous element, and also the minimum number of elements should be 3 in that set.

i.e., I can see that (1,2,3,4,5) forms one such series in which numbers appear after an interval of 1 and the total size of this set is 5 which satisfies the minimum threshold criteria. (1,3,5) forms one such a pattern in which the numbers appear after an interval of 2.

(8,10,12,14) forms another such pattern with an interval of 2. So, as you can see, the interval of repetition can be anything.

Also, for a particular set, I want its maximal one. I dont want, (8,10,12) (although it satisfies the minimum threshold of 3 and constant difference ) as the output and only of the maximal length I want, i.e. (8,10,12,14).

Similarly, for, (1,2,3,4,5) , I dont want (1,2,3) or (2,3,4,5) as the output, only the MAXIMAL LENGTH ONE I WANT, i.e. (1,2,3,4,5).

How can I do this in R?

Edit: That is, I want any set which forms a basic AP series with any difference, however the total value should be greater than 3 in that series and it should be maximal.

Edit2: I have tried using rle and acf in R but that doesnt entirely solves my problem.

Edit3: When I did acf, it basically gave me the maximum peak difference that I could have used. However, I want all the differences possible. Also, rle is just way different. It gave me the longest continuous sequence of similar numbers. Which is not there in my case.

回答1:

If you are looking for sequences of consecutive numbers, then cgwtools::seqle will find them for you in the same way rle finds a sequence of repeated values.

In the general case of basically any subset of your data which form such a sequence, such as the 8,10,12,14 case you cite, your criteria are so general as to be very difficult to satisfy. You'd have to start at each element of your series and do a forward-looking search for x[j] +1, x[j]+2, x[j]+3 ... ad infinitum. This suggests using some tree-based algorithms.

回答2:

Here's a potential solution - albeit a very ugly, sloppy one:

##
arithSeq <- function(x=nSeq, minSize=4){
  ##
  dx <- diff(x,lag=1)
  Runs <- rle(diff(x))
  ##
  rLens <- Runs[[1]]
  rVals <- Runs[[2]]
  pStart <- c(
    rep(1,rLens[1]),
    rep(cumsum(1+rLens[-length(rLens)]),times=rLens[-1])
  )
  pEnd <- pStart + c(
    rep(rLens[1]-1, rLens[1]),
    rep(rLens[-1],times=rLens[-1])
  )
  pGrp <- rep(1:length(rLens),times=rLens)
  pLen <- rep(rLens, times=rLens)
  dAll <- data.frame(
    pStart=pStart,
    pEnd=pEnd,
    pGrp=pGrp,
    pLen=pLen,
    runVal=rep(rVals,rLens)
  )
  ##
  dSub <- subset(dAll, pLen >= minSize - 1)
  ##
  uVals <- unique(dSub$runVal)
  ##
  maxSub <- subset(dSub, runVal==uVals[1])
  maxLen <- max(maxSub$pLen)
  maxSub <- subset(maxSub, pLen==maxLen)
  ##
  if(length(uVals) > 1){
    for(i in 2:length(uVals)){
      iSub <- subset(dSub, runVal==uVals[i])
      iMaxLen <- max(iSub$pLen)
      iSub <- subset(iSub, pLen==iMaxLen)
      maxSub <- rbind(
        maxSub,
        iSub)
      maxSub
    }
    ##
  }
  ##
  deDup <- maxSub[!duplicated(maxSub),]
  seqStarts <- as.numeric(rownames(deDup))
  outList <- list(NULL); length(outList) <- nrow(deDup)
  for(i in 1:nrow(deDup)){
    outList[[i]] <- list(
      Sequence = x[seqStarts[i]:(seqStarts[i]+deDup[i,"pLen"])],
      Length=deDup[i,"pLen"]+1,
      StartPosition=seqStarts[i],
      EndPosition=seqStarts[i]+deDup[i,"pLen"])
    outList
  }
  ##
  return(outList)
  ##
}
##

So there are things that can definitely be improved in this function - for instance I made a mistake somewhere in the calculation of pStart and pEnd, the start and end indices of a given arithmetic sequence, but it just so happened that the true start positions of such sequences are given as the rownumbers of one of the intermediate data.frames, so that was a hacky sort of solution. Anyways, it accepts a numeric vector x and a minimum length parameter, minSize. It will return a list containing information about sequences meeting the criteria you outlined above.

set.seed(1234)
lSeq <- sample(1:25,100000,replace=TRUE)
nSeq <- c(1:10,12,33,13:17,16:26)
##
> arithSeq(nSeq)
[[1]]
[[1]]$Sequence
 [1] 16 17 18 19 20 21 22 23 24 25 26

[[1]]$Length
[1] 11

[[1]]$StartPosition
[1] 18

[[1]]$EndPosition
[1] 28
##
> arithSeq(x=lSeq,minSize=5)
[[1]]
[[1]]$Sequence
[1] 13 16 19 22 25

[[1]]$Length
[1] 5

[[1]]$StartPosition
[1] 12760

[[1]]$EndPosition
[1] 12764


[[2]]
[[2]]$Sequence
[1] 11 13 15 17 19

[[2]]$Length
[1] 5

[[2]]$StartPosition
[1] 37988

[[2]]$EndPosition
[1] 37992

Like I said, its sloppy and inelegant, but it should get you started.

来源：https://stackoverflow.com/questions/24851973/arithmetic-progression-series-in-r

标签

pattern-matching