How to do discretization of continuous attributes in sklearn?

后端 未结 5 594
萌比男神i
萌比男神i 2021-01-02 08:41

My data consists of a mix of continuous and categorical features. Below is a small snippet of how my data looks like in the csv format (Consider it as data collected by a su

5条回答
  •  醉酒成梦
    2021-01-02 09:11

    Update (Sep 2018): As of version 0.20.0, there is a function, sklearn.preprocessing.KBinsDiscretizer, which provides discretization of continuous features using a few different strategies:

    • Uniformly-sized bins
    • Bins with "equal" numbers of samples inside (as much as possible)
    • Bins based on K-means clustering

    Unfortunately, at the moment, the function does not accept custom intervals (which is a bummer for me as that is what I wanted and the reason I ended up here). If you want to achieve the same, you can use Pandas function cut:

    import numpy as np
    import pandas as pd
    n_samples = 10
    a = np.random.randint(0, 10, n_samples)
    
    # say you want to split at 1 and 3
    boundaries = [1, 3]
    # add min and max values of your data
    boundaries = sorted({a.min(), a.max() + 1} | set(boundaries))
    
    a_discretized_1 = pd.cut(a, bins=boundaries, right=False)
    a_discretized_2 = pd.cut(a, bins=boundaries, labels=range(len(boundaries) - 1), right=False)
    a_discretized_3 = pd.cut(a, bins=boundaries, labels=range(len(boundaries) - 1), right=False).astype(float)
    print(a, '\n')
    print(a_discretized_1, '\n', a_discretized_1.dtype, '\n')
    print(a_discretized_2, '\n', a_discretized_2.dtype, '\n')
    print(a_discretized_3, '\n', a_discretized_3.dtype, '\n')
    

    which produces:

    [2 2 9 7 2 9 3 0 4 0]
    
    [[1, 3), [1, 3), [3, 10), [3, 10), [1, 3), [3, 10), [3, 10), [0, 1), [3, 10), [0, 1)]
    Categories (3, interval[int64]): [[0, 1) < [1, 3) < [3, 10)]
     category
    
    [1, 1, 2, 2, 1, 2, 2, 0, 2, 0]
    Categories (3, int64): [0 < 1 < 2]
     category
    
    [1. 1. 2. 2. 1. 2. 2. 0. 2. 0.]
     float64
    

    Note that, by default, pd.cut returns a pd.Series object of dtype Category with elements of type interval[int64]. If you specify your own labels, the dtype of the output will still be a Category, but the elements will be of type int64. If you want the series to have a numeric dtype, you can use .astype(np.int64).

    My example uses integer data, but it should work just as fine with floats.

提交回复
热议问题