Frequency table for a single variable

后端 未结 4 1491
温柔的废话
温柔的废话 2020-11-29 19:54

One last newbie pandas question for the day: How do I generate a table for a single Series?

For example:

my_series = pandas.Series([1,2,2,3,3,3])
pa         


        
4条回答
  •  时光取名叫无心
    2020-11-29 20:06

    for frequency distribution of a variable with excessive values you can collapse down the values in classes,

    Here I excessive values for employrate variable, and there's no meaning of it's frequency distribution with direct values_count(normalize=True)

                    country  employrate alcconsumption
    0           Afghanistan   55.700001            .03
    1               Albania   11.000000           7.29
    2               Algeria   11.000000            .69
    3               Andorra         nan          10.17
    4                Angola   75.699997           5.57
    ..                  ...         ...            ...
    208             Vietnam   71.000000           3.91
    209  West Bank and Gaza   32.000000               
    210         Yemen, Rep.   39.000000             .2
    211              Zambia   61.000000           3.56
    212            Zimbabwe   66.800003           4.96
    
    [213 rows x 3 columns]
    

    frequency distribution with values_count(normalize=True) with no classification,length of result here is 139 (seems meaningless as a frequency distribution):

    print(gm["employrate"].value_counts(sort=False,normalize=True))
    
    50.500000   0.005618
    61.500000   0.016854
    46.000000   0.011236
    64.500000   0.005618
    63.500000   0.005618
    
    58.599998   0.005618
    63.799999   0.011236
    63.200001   0.005618
    65.599998   0.005618
    68.300003   0.005618
    Name: employrate, Length: 139, dtype: float64
    

    putting classification we put all values with a certain range ie.

    0-10 as 1,
    11-20 as 2  
    21-30 as 3, and so forth.
    gm["employrate"]=gm["employrate"].str.strip().dropna()  
    gm["employrate"]=pd.to_numeric(gm["employrate"])
    gm['employrate'] = np.where(
       (gm['employrate'] <=10) & (gm['employrate'] > 0) , 1, gm['employrate']
       )
    gm['employrate'] = np.where(
       (gm['employrate'] <=20) & (gm['employrate'] > 10) , 1, gm['employrate']
       )
    gm['employrate'] = np.where(
       (gm['employrate'] <=30) & (gm['employrate'] > 20) , 2, gm['employrate']
       )
    gm['employrate'] = np.where(
       (gm['employrate'] <=40) & (gm['employrate'] > 30) , 3, gm['employrate']
       )
    gm['employrate'] = np.where(
       (gm['employrate'] <=50) & (gm['employrate'] > 40) , 4, gm['employrate']
       )
    gm['employrate'] = np.where(
       (gm['employrate'] <=60) & (gm['employrate'] > 50) , 5, gm['employrate']
       )
    gm['employrate'] = np.where(
       (gm['employrate'] <=70) & (gm['employrate'] > 60) , 6, gm['employrate']
       )
    gm['employrate'] = np.where(
       (gm['employrate'] <=80) & (gm['employrate'] > 70) , 7, gm['employrate']
       )
    gm['employrate'] = np.where(
       (gm['employrate'] <=90) & (gm['employrate'] > 80) , 8, gm['employrate']
       )
    gm['employrate'] = np.where(
       (gm['employrate'] <=100) & (gm['employrate'] > 90) , 9, gm['employrate']
       )
    print(gm["employrate"].value_counts(sort=False,normalize=True))
    

    after classification we have a clear frequency distribution. here we can easily see, that 37.64% of countries have employ rate between 51-60% and 11.79% of countries have employ rate between 71-80%

    5.000000   0.376404
    7.000000   0.117978
    4.000000   0.179775
    6.000000   0.264045
    8.000000   0.033708
    3.000000   0.028090
    Name: employrate, dtype: float64
    

提交回复
热议问题