How to compute the probability of a value given a list of samples from a distribution in Python?

后端未结

关注

 3  476

春和景丽 2020-12-12 22:42

Not sure if this belongs in statistics, but I am trying to use Python to achieve this. I essentially just have a list of integers:

data = [300,244,543,1011,3


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   谎友^
                                             
                
                
                (楼主)
            
              
              
                2020-12-12 23:40
              

            
            
                        
OK I offer this as a starting point, but estimating densities is a very broad topic. For your case involving the amount of characters in a sequence, we can model this from a straight-forward frequentist perspective using empirical probability. Here, probability is essentially a generalization of the concept of percentage. In our model, the sample space is discrete and is all positive integers. Well, then you simply count the occurrences and divide by the total number of events to get your estimate for the probabilities. Anywhere we have zero observations, our estimate for the probability is zero.

>>> samples = [1,1,2,3,2,2,7,8,3,4,1,1,2,6,5,4,8,9,4,3]
>>> from collections import Counter
>>> counts = Counter(samples)
>>> counts
Counter({1: 4, 2: 4, 3: 3, 4: 3, 8: 2, 5: 1, 6: 1, 7: 1, 9: 1})
>>> total = sum(counts.values())
>>> total
20
>>> probability_mass = {k:v/total for k,v in counts.items()}
>>> probability_mass
{1: 0.2, 2: 0.2, 3: 0.15, 4: 0.15, 5: 0.05, 6: 0.05, 7: 0.05, 8: 0.1, 9: 0.05}
>>> probability_mass.get(2,0)
0.2
>>> probability_mass.get(12,0)
0


Now, for your timing data, it is more natural to model this as a continuous distribution. Instead of using a parametric approach where you assume that your data has some distribution and then fit that distribution to your data, you should take a non-parametric approach. One straightforward way is to use a kernel density estimate.  You can simply think of this as a way of smoothing a histogram to give you a continuous probability density function. There are several libraries available. Perhaps the most straightforward for univariate data is scipy's:

>>> import scipy.stats
>>> kde = scipy.stats.gaussian_kde(samples)
>>> kde.pdf(2)
array([ 0.15086911])


To get the probability of an observation in some interval:

>>> kde.integrate_box_1d(1,2)
0.13855869478828692

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复