Equal frequency discretization in R

前端 未结 8 1553
南方客
南方客 2020-12-17 02:56

I\'m having trouble finding a function in R that performs equal-frequency discretization. I stumbled on the \'infotheo\' package, but after some testing I found that the al

8条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-17 03:49

    EDIT : given your real goal, why don't you just do (corrected) :

     EqualFreq2 <- function(x,n){
        nx <- length(x)
        nrepl <- floor(nx/n)
        nplus <- sample(1:n,nx - nrepl*n)
        nrep <- rep(nrepl,n)
        nrep[nplus] <- nrepl+1
        x[order(x)] <- rep(seq.int(n),nrep)
        x
    }
    

    This returns a vector with indicators for which bin they are. But as some values might be present in both bins, you can't possibly define the bin limits. But you can do :

    x <- rpois(50,5)
    y <- EqualFreq2(x,15)
    table(y)
    split(x,y)
    

    Original answer:

    You can easily just use cut() for this :

    EqualFreq <-function(x,n,include.lowest=TRUE,...){
        nx <- length(x)    
        id <- round(c(1,(1:(n-1))*(nx/n),nx))
    
        breaks <- sort(x)[id]
        if( sum(duplicated(breaks))>0 stop("n is too large.")
    
        cut(x,breaks,include.lowest=include.lowest,...)
    
    }
    

    Which gives :

    set.seed(12345)
    x <- rnorm(50)
    table(EqualFreq(x,5))
    
     [-2.38,-0.886] (-0.886,-0.116]  (-0.116,0.586]   (0.586,0.937]     (0.937,2.2] 
                 10              10              10              10              10 
    
    x <- rpois(50,5)
    table(EqualFreq(x,5))
    
     [1,3]  (3,5]  (5,6]  (6,7] (7,11] 
        10     13     11      6     10 
    

    As you see, for discrete data an optimal equal binning is rather impossible in most cases, but this method gives you the best possible binning available.

提交回复
热议问题