How to do discretization of continuous attributes in sklearn?

后端 未结 5 561
萌比男神i
萌比男神i 2021-01-02 08:41

My data consists of a mix of continuous and categorical features. Below is a small snippet of how my data looks like in the csv format (Consider it as data collected by a su

5条回答
  •  春和景丽
    2021-01-02 09:16

    Thanks to the ideas above;

    To Discretizate continuous values, you may utilize:

    1. the Pandas cut or qcut functions (input array Must be 1-dimensional)

    or

    1. the sklearn's KBinsDiscretizer function (with parameter encode set to ‘ordinal’)

      • parameter strategy = uniform will discretize in the same manner as pd.cut
      • parameter strategy = quantile will discretize in the same manner as pd.qcut function

    Since examples for cut/qcut are provided in previous answers, here let's go on with a clean example on KBinsDiscretizer:

    import numpy as np
    from sklearn.preprocessing import KBinsDiscretizer
    
    A = np.array([[24,0.2],[35,0.3],[74,0.4], [96,0.5],[2,0.6],[39,0.8]])
    print(A)
    # [[24.   0.2]
    #  [35.   0.3]
    #  [74.   0.4]
    #  [96.   0.5]
    #  [ 2.   0.6]
    #  [39.   0.8]]
    
    
    enc = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
    enc.fit(A)
    print(enc.transform(A))
    # [[0. 0.]
    #  [1. 0.]
    #  [2. 1.]
    #  [2. 1.]
    #  [0. 2.]
    #  [1. 2.]]
    

    As shown in the output, each feature has been discretized into 3 bins. Hope this helped :)


    Final notes:

    • To compare cut versus qcut, see this post
    • For your categorical features, you can utilize KBinsDiscretizer(encode='onehot') to perform one-hot encoding on that feature

提交回复
热议问题