Calculate summary statistics of columns in dataframe

后端 未结 3 1801
萌比男神i
萌比男神i 2020-12-07 20:38

I have a dataframe of the following form (for example)

shopper_num,is_martian,number_of_items,count_pineapples,birth_country,tranpsortation_method
1,FALSE,0,         


        
相关标签:
3条回答
  • 2020-12-07 20:55

    describe may give you everything you want otherwise you can perform aggregations using groupby and pass a list of agg functions: http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once

    In [43]:
    
    df.describe()
    
    Out[43]:
    
           shopper_num is_martian  number_of_items  count_pineapples
    count      14.0000         14        14.000000                14
    mean        7.5000          0         3.357143                 0
    std         4.1833          0         6.452276                 0
    min         1.0000      False         0.000000                 0
    25%         4.2500          0         0.000000                 0
    50%         7.5000          0         0.000000                 0
    75%        10.7500          0         3.500000                 0
    max        14.0000      False        22.000000                 0
    
    [8 rows x 4 columns]
    

    Note that some columns cannot be summarised as there is no logical way to summarise them, for instance columns containing string data

    As you prefer you can transpose the result if you prefer:

    In [47]:
    
    df.describe().transpose()
    
    Out[47]:
    
                     count      mean       std    min   25%  50%    75%    max
    shopper_num         14       7.5    4.1833      1  4.25  7.5  10.75     14
    is_martian          14         0         0  False     0    0      0  False
    number_of_items     14  3.357143  6.452276      0     0    0    3.5     22
    count_pineapples    14         0         0      0     0    0      0      0
    
    [4 rows x 8 columns]
    
    0 讨论(0)
  • 2020-12-07 20:58

    Now there is the pandas_profiling package, which is a more complete alternative to df.describe().

    If your pandas dataframe is df, the below will return a complete analysis including some warnings about missing values, skewness, etc. It presents histograms and correlation plots as well.

    import pandas_profiling
    pandas_profiling.ProfileReport(df)
    

    See the example notebook detailing the usage.

    0 讨论(0)
  • 2020-12-07 21:03

    To clarify one point in @EdChum's answer, per the documentation, you can include the object columns by using df.describe(include='all'). It won't provide many statistics, but will provide a few pieces of info, including count, number of unique values, top value. This may be a new feature, I don't know as I am a relatively new user.

    0 讨论(0)
提交回复
热议问题