How can I get descriptive statistics of a NumPy array?

前端 未结 3 2159
臣服心动
臣服心动 2021-01-01 09:50

I use the following code to create a numpy-ndarray. The file has 9 columns. I explicitly type each column:

dataset = np.genfromtxt(\"data.csv\", delimiter=\"         


        
3条回答
  •  长情又很酷
    2021-01-01 10:16

    The question of how to deal with mixed data from genfromtxt comes up often. People expect a 2d array, and instead get a 1d that they can't index by column. That's because they get a structured array - with different dtype for each column.

    All the examples in the genfromtxt doc show this:

    >>> s = StringIO("1,1.3,abcde")
    >>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
    ... ('mystring','S5')], delimiter=",")
    >>> data
    array((1, 1.3, 'abcde'),
          dtype=[('myint', '

    But let me demonstrate how to access this kind of data

    In [361]: txt=b"""A, 1,2,3
         ...: B,4,5,6
         ...: """
    In [362]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,int,float,int'))
    In [363]: data
    Out[363]: 
    array([(b'A', 1, 2.0, 3), (b'B', 4, 5.0, 6)], 
          dtype=[('f0', 'S1'), ('f1', '

    So my array has 2 records (check the shape), which are displayed as tuples in a list.

    You access fields by name, not by column number (do I need to add a structured array documentation link?)

    In [364]: data['f0']
    Out[364]: 
    array([b'A', b'B'], 
          dtype='|S1')
    In [365]: data['f1']
    Out[365]: array([1, 4])
    

    In a case like this might be more useful if I choose a dtype with 'subarrays'. This a more advanced dtype topic

    In [367]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,(3)float'))
    In [368]: data
    Out[368]: 
    array([(b'A', [1.0, 2.0, 3.0]), (b'B', [4.0, 5.0, 6.0])], 
          dtype=[('f0', 'S1'), ('f1', '

    The character column is still loaded as S1, but the numbers are now in a 3 column array. Note that they are all float (or int).

    In [371]: from scipy import stats
    In [372]: stats.describe(data['f1'])
    Out[372]: DescribeResult(nobs=2, 
       minmax=(array([ 1.,  2.,  3.]), array([ 4.,  5.,  6.])),
       mean=array([ 2.5,  3.5,  4.5]), 
       variance=array([ 4.5,  4.5,  4.5]), 
       skewness=array([ 0.,  0.,  0.]), 
       kurtosis=array([-2., -2., -2.]))
    

提交回复
热议问题