Fitting empirical distribution to theoretical ones with Scipy (Python)?

前端未结
关注
 9  815
醉话见心 2020-11-22 05:28
INTRODUCTION: I have a list of more than 30,000 integer values ranging from 0 to 47, inclusive, e.g.[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...]<

      
      
        
          9条回答        

        
                    
            
            
                         
                
              
              
                
                   自闭症患者
                                             
                
                
                (楼主)
            
              
              
                2020-11-22 05:50
              

            
            
                        
While many of the above answers are completely valid, no one seems to answer your question completely, specifically the part:

I don't know if I am right, but to determine probabilities I think I need to fit my data to a theoretical distribution that is the most suitable to describe my data. I assume that some kind of goodness of fit test is needed to determine the best model.

The parametric approach
This is the process you're describing of using some theoretical distribution and fitting the parameters to your data and there's some excellent answers how to do this.
The non-parametric approach
However, it's also possible to use a non-parametric approach to your problem, which means you do not assume any underlying distribution at all.
By using the so-called Empirical distribution function which equals:
Fn(x)= SUM( I[X<=x] ) / n. So the proportion of values below x.
As was pointed out in one of the above answers is that what you're interested in is the inverse CDF (cumulative distribution function), which is equal to 1-F(x)
It can be shown that the empirical distribution function will converge to whatever 'true' CDF that generated your data.
Furthermore, it is straightforward to construct a 1-alpha confidence interval by:
L(X) = max{Fn(x)-en, 0}
U(X) = min{Fn(x)+en, 0}
en = sqrt( (1/2n)*log(2/alpha)

Then P( L(X) <= F(X) <= U(X) ) >= 1-alpha for all x.
I'm quite surprised that PierrOz answer has 0 votes, while it's a completely valid answer to the question using a non-parametric approach to estimating F(x).
Note that the issue you mention of P(X>=x)=0 for any x>47 is simply a personal preference that might lead you to chose the parametric approach above the non-parametric approach. Both approaches however are completely valid solutions to your problem.
For more details and proofs of the above statements I would recommend having a look at
'All of Statistics: A Concise Course in Statistical Inference by Larry A. Wasserman'. An excellent book on both parametric and non-parametric inference.
EDIT:
Since you specifically asked for some python examples it can be done using numpy:
import numpy as np

def empirical_cdf(data, x):
    return np.sum(x<=data)/len(data)

def p_value(data, x):
    return 1-empirical_cdf(data, x)

# Generate some data for demonstration purposes
data = np.floor(np.random.uniform(low=0, high=48, size=30000))

print(empirical_cdf(data, 20))
print(p_value(data, 20)) # This is the value you're interested in

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它9个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复