Generate distribution given percentile ranks

后端 未结 3 638
旧时难觅i
旧时难觅i 2021-01-07 15:21

I\'d like to generate a distribution in R given the following score and percentile ranks.

x <- 1:10
PercRank <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)         


        
3条回答
  •  一个人的身影
    2021-01-07 16:14

    From Wikipedia:

    The percentile rank of a score is the percentage of scores in its frequency distribution that are the same or lower than it.

    In order to illustrate this, let's create a distribution, say, a normal distribution, with mean=2 and sd=2, so that we can test (our code) later.

    # 1000 samples from normal(2,2)
    x1 <- rnorm(1000, mean=2, sd=2)
    

    Now, let's take the same percentile rank you've mentioned in your post. Let's divide it by 100 so that they represent cumulative probabilities.

    cum.p <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)/100
    

    And what are the values (scores) corresponding to these percentiles?

    # generating values similar to your x.
    x <- c(t(quantile(x1, cum.p)))
    > x
     [1] -2.1870396 -1.4707273 -1.1535935 -0.8265444 -0.2888791  
             0.2781699  0.5893503  0.8396868  1.4222489  2.1519328
    

    This means that 1% of the data is lesser than -2.18. 7% of the data is lesser than -1.47 etc... Now, we have the x and cum.p (equivalent to your PercRank). Let's forget x1 and the fact that this should be a normal distribution. To find out what distribution it could be, let's get actual probabilities from the cumulative probabilities by using diff that takes the difference between nth and (n-1)th element.

    prob <- c( cum.p[1], diff(cum.p), .01)
    > prob
    # [1] 0.01 0.06 0.05 0.11 0.18 0.21 0.11 0.07 0.12 0.07 0.01
    

    Now, all we have to do is is to generate samples of size, say, 100 (could be any number), for each interval of x (x[1]:x[2], x[2]:x[3] ...) and then finally sample from this huge data as many number of points as you need (say, 10000), with probabilities mentioned above.

    This can be done by:

    freq <- 10000 # final output size that we want
    
    # Extreme values beyond x (to sample)
    init <- -(abs(min(x)) + 5) 
    fin  <- abs(max(x)) + 5
    
    ival <- c(init, x, fin) # generate the sequence to take pairs from
    len <- 100 # sequence of each pair
    
    s <- sapply(2:length(ival), function(i) {
        seq(ival[i-1], ival[i], length.out=len)
    })
    # sample from s, total of 10000 values with probabilities calculated above
    out <- sample(s, freq, prob=rep(prob, each=len), replace = T)
    

    Now, we have 10000 samples from the distribution. Let's look at how it is. It should resemble a normal distribution with mean = 2 and sd = 2.

    > hist(out)
    

    normal_dist

    > c(mean(out), sd(out))
    # [1] 1.954834 2.170683
    

    It is a normal distribution (from the histogram) with mean = 1.95 and sd = 2.17 (~ 2).

    Note: Some things what I've explained may have been roundabout and/or the code "may/may not" work with some other distributions. The point of this post was just to explain the concept with a simple example.

    Edit: In an attempt to clarify @Dwin's point, I tried the same code with x = 1:10 corresponding to OP's question, with the same code by replacing the value of x.

    cum.p <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)/100
    prob <- c( cum.p[1], diff(cum.p), .01)
    x <- 1:10
    
    freq <- 10000 # final output size that we want
    
    # Extreme values beyond x (to sample)
    init <- -(abs(min(x)) + 1) 
    fin  <- abs(max(x)) + 1
    
    ival <- c(init, x, fin) # generate the sequence to take pairs from
    len <- 100 # sequence of each pair
    
    s <- sapply(2:length(ival), function(i) {
        seq(ival[i-1], ival[i], length.out=len)
    })
    # sample from s, total of 10000 values with probabilities calculated above
    out <- sample(s, freq, prob=rep(prob, each=len), replace = T)
    
    > quantile(out, cum.p) # ~ => x = 1:10
    # 1%     7%    12%    23%    41%    62%    73%    80%    92%    99% 
    # 0.878  1.989  2.989  4.020  5.010  6.030  7.030  8.020  9.050 10.010 
    
    > hist(out)
    

    hist_OPs_data

提交回复
热议问题