How can I get descriptive statistics of a NumPy array?

前端未结

关注

 3  2159

臣服心动 2021-01-01 09:50

I use the following code to create a numpy-ndarray. The file has 9 columns. I explicitly type each column:

dataset = np.genfromtxt(\"data.csv\", delimiter=\"


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   长情又很酷
                                             
                
                
                (楼主)
            
              
              
                2021-01-01 10:16
              

            
            
                        
The question of how to deal with mixed data from genfromtxt comes up often.  People expect a 2d array, and instead get a 1d that they can't index by column.  That's because they get a structured array - with different dtype for each column.

All the examples in the genfromtxt doc show this:

>>> s = StringIO("1,1.3,abcde")
>>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
... ('mystring','S5')], delimiter=",")
>>> data
array((1, 1.3, 'abcde'),
      dtype=[('myint', '


But let me demonstrate how to access this kind of data

In [361]: txt=b"""A, 1,2,3
     ...: B,4,5,6
     ...: """
In [362]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,int,float,int'))
In [363]: data
Out[363]: 
array([(b'A', 1, 2.0, 3), (b'B', 4, 5.0, 6)], 
      dtype=[('f0', 'S1'), ('f1', '


So my array has 2 records (check the shape), which are displayed as tuples in a list.

You access fields by name, not by column number (do I need to add a structured array documentation link?)

In [364]: data['f0']
Out[364]: 
array([b'A', b'B'], 
      dtype='|S1')
In [365]: data['f1']
Out[365]: array([1, 4])


In a case like this might be more useful if I choose a dtype with 'subarrays'.  This a more advanced dtype topic

In [367]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,(3)float'))
In [368]: data
Out[368]: 
array([(b'A', [1.0, 2.0, 3.0]), (b'B', [4.0, 5.0, 6.0])], 
      dtype=[('f0', 'S1'), ('f1', '


The character column is still  loaded as S1, but the numbers are now in a 3 column array.  Note that they are all float (or int).  

In [371]: from scipy import stats
In [372]: stats.describe(data['f1'])
Out[372]: DescribeResult(nobs=2, 
   minmax=(array([ 1.,  2.,  3.]), array([ 4.,  5.,  6.])),
   mean=array([ 2.5,  3.5,  4.5]), 
   variance=array([ 4.5,  4.5,  4.5]), 
   skewness=array([ 0.,  0.,  0.]), 
   kurtosis=array([-2., -2., -2.]))

    
             
                                                        
            

            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          

                              			
        

        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复