Generate distribution given percentile ranks

后端未结

关注

 3  636

I\'d like to generate a distribution in R given the following score and percentile ranks.

x <- 1:10
PercRank <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  星月不相逢        
                
              
                            
                2021-01-07 16:00
              
            
            
                                                                       
i think you want the ecdf function, which is mentioned as the inverse of the quantile function on the ?quantile help page..

# construct your vector containing the data
PercRank <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)

# construct an empirical cumulative distribution function
# which is really just the `inverse` of `quantile
Fn <- ( ecdf( PercRank ) )
# note that the `ecdf` function returns a function itself.

# calculate what percent of `PercRank` is below these integers..
Fn( 0 )
Fn( 1 )
Fn( 2 )
Fn( 3 )
Fn( 6 )
Fn( 7 )
Fn( 8 )


# re-construct your `x` vector using PercRank
Fn( PercRank ) * 10

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  眼角桃花        
                
              
                            
                2021-01-07 16:09
              
            
            
                                                                       
This give produces a dataset that would have the features that you specify. If you wanted more "randomness" you could subtract some random number within the range of the percentile span to rep result inside the anonymous function:

   > mapply( function(x,y) rep(x, each=y), (x),  diff(c(PercRank, 100) ) )
[[1]]
[1] 1 1 1 1 1 1

[[2]]
[1] 2 2 2 2 2

[[3]]
 [1] 3 3 3 3 3 3 3 3 3 3 3

[[4]]
 [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[[5]]
 [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

[[6]]
 [1] 6 6 6 6 6 6 6 6 6 6 6

[[7]]
[1] 7 7 7 7 7 7 7

[[8]]
 [1] 8 8 8 8 8 8 8 8 8 8 8 8

[[9]]
[1] 9 9 9 9 9 9 9

[[10]]
[1] 10

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一个人的身影        
                
              
                            
                2021-01-07 16:14
              
            
            
                                                                       
From Wikipedia: 


  The percentile rank of a score is the percentage of scores in its frequency distribution that are the same or lower than it.


In order to illustrate this, let's create a distribution, say, a normal distribution, with mean=2 and sd=2, so that we can test (our code) later.

# 1000 samples from normal(2,2)
x1 <- rnorm(1000, mean=2, sd=2)


Now, let's take the same percentile rank you've mentioned in your post. Let's divide it by 100 so that they represent cumulative probabilities.

cum.p <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)/100


And what are the values (scores) corresponding to these percentiles?

# generating values similar to your x.
x <- c(t(quantile(x1, cum.p)))
> x
 [1] -2.1870396 -1.4707273 -1.1535935 -0.8265444 -0.2888791  
         0.2781699  0.5893503  0.8396868  1.4222489  2.1519328


This means that 1% of the data is lesser than -2.18. 7% of the data is lesser than -1.47 etc... Now, we have the x and cum.p (equivalent to your PercRank). Let's forget x1 and the fact that this should be a normal distribution. To find out what distribution it could be, let's get actual probabilities from the cumulative probabilities by using diff that takes the difference between nth and (n-1)th element. 

prob <- c( cum.p[1], diff(cum.p), .01)
> prob
# [1] 0.01 0.06 0.05 0.11 0.18 0.21 0.11 0.07 0.12 0.07 0.01


Now, all we have to do is is to generate samples of size, say, 100 (could be any number), for each interval of x (x[1]:x[2], x[2]:x[3] ...) and then finally sample from this huge data as many number of points as you need (say, 10000), with probabilities mentioned above.

This can be done by:

freq <- 10000 # final output size that we want

# Extreme values beyond x (to sample)
init <- -(abs(min(x)) + 5) 
fin  <- abs(max(x)) + 5

ival <- c(init, x, fin) # generate the sequence to take pairs from
len <- 100 # sequence of each pair

s <- sapply(2:length(ival), function(i) {
    seq(ival[i-1], ival[i], length.out=len)
})
# sample from s, total of 10000 values with probabilities calculated above
out <- sample(s, freq, prob=rep(prob, each=len), replace = T)


Now, we have 10000 samples from the distribution. Let's look at how it is. It should resemble a normal distribution with mean = 2 and sd = 2.

> hist(out)




> c(mean(out), sd(out))
# [1] 1.954834 2.170683


It is a normal distribution (from the histogram) with mean = 1.95 and sd = 2.17 (~ 2). 

Note: Some things what I've explained may have been roundabout and/or the code "may/may not" work with some other distributions. The point of this post was just to explain the concept with a simple example.

Edit: In an attempt to clarify @Dwin's point, I tried the same code with x = 1:10 corresponding to OP's question, with the same code by replacing the value of x.

cum.p <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)/100
prob <- c( cum.p[1], diff(cum.p), .01)
x <- 1:10

freq <- 10000 # final output size that we want

# Extreme values beyond x (to sample)
init <- -(abs(min(x)) + 1) 
fin  <- abs(max(x)) + 1

ival <- c(init, x, fin) # generate the sequence to take pairs from
len <- 100 # sequence of each pair

s <- sapply(2:length(ival), function(i) {
    seq(ival[i-1], ival[i], length.out=len)
})
# sample from s, total of 10000 values with probabilities calculated above
out <- sample(s, freq, prob=rep(prob, each=len), replace = T)

> quantile(out, cum.p) # ~ => x = 1:10
# 1%     7%    12%    23%    41%    62%    73%    80%    92%    99% 
# 0.878  1.989  2.989  4.020  5.010  6.030  7.030  8.020  9.050 10.010 

> hist(out)



                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复