Generate distribution given percentile ranks

后端未结

关注

 3  638

旧时难觅i 2021-01-07 15:21

I\'d like to generate a distribution in R given the following score and percentile ranks.

x <- 1:10
PercRank <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   一个人的身影
                                             
                
                
                (楼主)
            
              
              
                2021-01-07 16:14
              

            
            
                        
From Wikipedia: 


  The percentile rank of a score is the percentage of scores in its frequency distribution that are the same or lower than it.


In order to illustrate this, let's create a distribution, say, a normal distribution, with mean=2 and sd=2, so that we can test (our code) later.

# 1000 samples from normal(2,2)
x1 <- rnorm(1000, mean=2, sd=2)


Now, let's take the same percentile rank you've mentioned in your post. Let's divide it by 100 so that they represent cumulative probabilities.

cum.p <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)/100


And what are the values (scores) corresponding to these percentiles?

# generating values similar to your x.
x <- c(t(quantile(x1, cum.p)))
> x
 [1] -2.1870396 -1.4707273 -1.1535935 -0.8265444 -0.2888791  
         0.2781699  0.5893503  0.8396868  1.4222489  2.1519328


This means that 1% of the data is lesser than -2.18. 7% of the data is lesser than -1.47 etc... Now, we have the x and cum.p (equivalent to your PercRank). Let's forget x1 and the fact that this should be a normal distribution. To find out what distribution it could be, let's get actual probabilities from the cumulative probabilities by using diff that takes the difference between nth and (n-1)th element. 

prob <- c( cum.p[1], diff(cum.p), .01)
> prob
# [1] 0.01 0.06 0.05 0.11 0.18 0.21 0.11 0.07 0.12 0.07 0.01


Now, all we have to do is is to generate samples of size, say, 100 (could be any number), for each interval of x (x[1]:x[2], x[2]:x[3] ...) and then finally sample from this huge data as many number of points as you need (say, 10000), with probabilities mentioned above.

This can be done by:

freq <- 10000 # final output size that we want

# Extreme values beyond x (to sample)
init <- -(abs(min(x)) + 5) 
fin  <- abs(max(x)) + 5

ival <- c(init, x, fin) # generate the sequence to take pairs from
len <- 100 # sequence of each pair

s <- sapply(2:length(ival), function(i) {
    seq(ival[i-1], ival[i], length.out=len)
})
# sample from s, total of 10000 values with probabilities calculated above
out <- sample(s, freq, prob=rep(prob, each=len), replace = T)


Now, we have 10000 samples from the distribution. Let's look at how it is. It should resemble a normal distribution with mean = 2 and sd = 2.

> hist(out)




> c(mean(out), sd(out))
# [1] 1.954834 2.170683


It is a normal distribution (from the histogram) with mean = 1.95 and sd = 2.17 (~ 2). 

Note: Some things what I've explained may have been roundabout and/or the code "may/may not" work with some other distributions. The point of this post was just to explain the concept with a simple example.

Edit: In an attempt to clarify @Dwin's point, I tried the same code with x = 1:10 corresponding to OP's question, with the same code by replacing the value of x.

cum.p <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)/100
prob <- c( cum.p[1], diff(cum.p), .01)
x <- 1:10

freq <- 10000 # final output size that we want

# Extreme values beyond x (to sample)
init <- -(abs(min(x)) + 1) 
fin  <- abs(max(x)) + 1

ival <- c(init, x, fin) # generate the sequence to take pairs from
len <- 100 # sequence of each pair

s <- sapply(2:length(ival), function(i) {
    seq(ival[i-1], ival[i], length.out=len)
})
# sample from s, total of 10000 values with probabilities calculated above
out <- sample(s, freq, prob=rep(prob, each=len), replace = T)

> quantile(out, cum.p) # ~ => x = 1:10
# 1%     7%    12%    23%    41%    62%    73%    80%    92%    99% 
# 0.878  1.989  2.989  4.020  5.010  6.030  7.030  8.020  9.050 10.010 

> hist(out)



    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复