Plotting CDF of a pandas series in python

后端 未结 7 1152
情话喂你
情话喂你 2020-12-23 09:25

Is there a way to do this? I cannot seem an easy way to interface pandas series with plotting a CDF.

7条回答
  •  悲哀的现实
    2020-12-23 09:34

    In case you are also interested in the values, not just the plot.

    import pandas as pd
    
    # If you are in jupyter
    %matplotlib inline
    

    This will always work (discrete and continuous distributions)

    # Define your series
    s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
    df = pd.DataFrame(s)
    
    # Get the frequency, PDF and CDF for each value in the series
    
    # Frequency
    stats_df = df \
    .groupby('value') \
    ['value'] \
    .agg('count') \
    .pipe(pd.DataFrame) \
    .rename(columns = {'value': 'frequency'})
    
    # PDF
    stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
    
    # CDF
    stats_df['cdf'] = stats_df['pdf'].cumsum()
    stats_df = stats_df.reset_index()
    stats_df
    

    # Plot the discrete Probability Mass Function and CDF.
    # Technically, the 'pdf label in the legend and the table the should be 'pmf'
    # (Probability Mass Function) since the distribution is discrete.
    
    # If you don't have too many values / usually discrete case
    stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)
    

    Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values:

    # Define your series
    s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
    
    # ... all the same calculation stuff to get the frequency, PDF, CDF
    
    # Plot
    stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)
    

    For continuous distributions only

    Please note if it very reasonable to make the assumption that there is only one occurence of each value in the sample (typically encountered in the case of continuous distributions) then the groupby() + agg('count') is not necessary (since the count is always 1).

    In this case, a percent rank can be used to get to the cdf directly.

    Use your best judgment when taking this kind of shortcut! :)

    # Define your series
    s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
    df = pd.DataFrame(s)
    
    # Get to the CDF directly
    df['cdf'] = df.rank(method = 'average', pct = True)
    
    # Sort and plot
    df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)
    

提交回复
热议问题