Equal frequency discretization in R

前端未结

关注

 8  1561

I\'m having trouble finding a function in R that performs equal-frequency discretization. I stumbled on the \'infotheo\' package, but after some testing I found that the al

相关标签:

8条回答

天命终不由人

2020-12-17 03:41

How about?

a <- rnorm(50)
> table(Hmisc::cut2(a, m = 10))

[-2.2020,-0.7710) [-0.7710,-0.2352) [-0.2352, 0.0997) [ 0.0997, 0.9775) 
               10                10                10                10 
[ 0.9775, 2.5677] 
               10

0 讨论(0)

南方客

2020-12-17 03:41

We can use package cutr with feature what = "rough", the look of labels can be customized to taste :

# devtools::install_github("moodymudskipper/cutr")
library(cutr)
smart_cut(c(1, 3, 2, 1, 2, 2), 2, "rough", brackets = NULL, sep="-")
# [1] 1-2 2-3 1-2 1-2 2-3 2-3
# Levels: 1-2 < 2-3

0 讨论(0)

野趣味

2020-12-17 03:49

EDIT : given your real goal, why don't you just do (corrected) :

 EqualFreq2 <- function(x,n){
    nx <- length(x)
    nrepl <- floor(nx/n)
    nplus <- sample(1:n,nx - nrepl*n)
    nrep <- rep(nrepl,n)
    nrep[nplus] <- nrepl+1
    x[order(x)] <- rep(seq.int(n),nrep)
    x
}

This returns a vector with indicators for which bin they are. But as some values might be present in both bins, you can't possibly define the bin limits. But you can do :

x <- rpois(50,5)
y <- EqualFreq2(x,15)
table(y)
split(x,y)

Original answer:

You can easily just use cut() for this :

EqualFreq <-function(x,n,include.lowest=TRUE,...){
    nx <- length(x)    
    id <- round(c(1,(1:(n-1))*(nx/n),nx))

    breaks <- sort(x)[id]
    if( sum(duplicated(breaks))>0 stop("n is too large.")

    cut(x,breaks,include.lowest=include.lowest,...)

}

Which gives :

set.seed(12345)
x <- rnorm(50)
table(EqualFreq(x,5))

 [-2.38,-0.886] (-0.886,-0.116]  (-0.116,0.586]   (0.586,0.937]     (0.937,2.2] 
             10              10              10              10              10 

x <- rpois(50,5)
table(EqualFreq(x,5))

 [1,3]  (3,5]  (5,6]  (6,7] (7,11] 
    10     13     11      6     10

As you see, for discrete data an optimal equal binning is rather impossible in most cases, but this method gives you the best possible binning available.

0 讨论(0)

北海茫月

2020-12-17 03:49

Here is a function that handle the error :'breaks' are not unique, and automatically select the closest n_bins value to the one you setted up.

equal_freq <- function(var, n_bins)
{
  require(ggplot2)

  n_bins_orig=n_bins

  res=tryCatch(cut_number(var, n = n_bins), error=function(e) {return (e)})
  while(grepl("'breaks' are not unique", res[1]) & n_bins>1)
  {
    n_bins=n_bins-1
    res=tryCatch(cut_number(var, n = n_bins), error=function(e) {return (e)})

  }
  if(n_bins_orig != n_bins)
    warning(sprintf("It's not possible to calculate with n_bins=%s, setting n_bins in: %s.", n_bins_orig, n_bins))

  return(res)
}

Example:

equal_freq(mtcars$carb, 10)

Which retrieves the binned variable and the following warning:

It's not possible to calculate with n_bins=10, setting n_bins in: 5.

0 讨论(0)

情书的邮戳

2020-12-17 03:49

Here is a one liner solution inspired by @Joris' answer:

x <- rpois(50,5)
binSize <- 5
desiredFrequency = floor(length(x)/binSize)
split(sort(x), rep(1:binSize, rep(desiredFrequency, binSize)))

0 讨论(0)

闹比i

2020-12-17 03:50
The classInt library is created "for choosing univariate class intervals for mapping or other graphics purposes". You can just do:
```
dataset <- c(1,3,2,1,2,2) 

library(classInt)
classIntervals(dataset, 2, style = 'quantile')
```
where 2 is the number of bins you want and the quantile style provides quantile breaks. There are several styles available for this function: "fixed", "sd", "equal", "pretty", "quantile", "kmeans", "hclust", "bclust", "fisher", or "jenks". Check docs for more info.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页