How to do discretization of continuous attributes in sklearn?

后端未结

关注

 5  561

萌比男神i 2021-01-02 08:41

My data consists of a mix of continuous and categorical features. Below is a small snippet of how my data looks like in the csv format (Consider it as data collected by a su

5条回答

春和景丽 (楼主)

2021-01-02 09:16
Thanks to the ideas above;

To Discretizate continuous values, you may utilize:
1. the Pandas cut or qcut functions (input array Must be 1-dimensional)
or
1. the sklearn's KBinsDiscretizer function (with parameter encode set to ‘ordinal’)
  - parameter strategy = uniform will discretize in the same manner as pd.cut
  - parameter strategy = quantile will discretize in the same manner as pd.qcut function
Since examples for cut/qcut are provided in previous answers, here let's go on with a clean example on KBinsDiscretizer:
```
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer

A = np.array([[24,0.2],[35,0.3],[74,0.4], [96,0.5],[2,0.6],[39,0.8]])
print(A)
# [[24.   0.2]
#  [35.   0.3]
#  [74.   0.4]
#  [96.   0.5]
#  [ 2.   0.6]
#  [39.   0.8]]


enc = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
enc.fit(A)
print(enc.transform(A))
# [[0. 0.]
#  [1. 0.]
#  [2. 1.]
#  [2. 1.]
#  [0. 2.]
#  [1. 2.]]
```
As shown in the output, each feature has been discretized into 3 bins. Hope this helped :)

Final notes:
- To compare cut versus qcut, see this post
- For your categorical features, you can utilize KBinsDiscretizer(encode='onehot') to perform one-hot encoding on that feature
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...