How to do discretization of continuous attributes in sklearn?

后端未结

关注

 5  594

萌比男神i 2021-01-02 08:41

My data consists of a mix of continuous and categorical features. Below is a small snippet of how my data looks like in the csv format (Consider it as data collected by a su

5条回答

醉酒成梦 (楼主)

2021-01-02 09:11
Update (Sep 2018): As of version 0.20.0, there is a function, sklearn.preprocessing.KBinsDiscretizer, which provides discretization of continuous features using a few different strategies:
- Uniformly-sized bins
- Bins with "equal" numbers of samples inside (as much as possible)
- Bins based on K-means clustering
Unfortunately, at the moment, the function does not accept custom intervals (which is a bummer for me as that is what I wanted and the reason I ended up here). If you want to achieve the same, you can use Pandas function cut:
```
import numpy as np
import pandas as pd
n_samples = 10
a = np.random.randint(0, 10, n_samples)

# say you want to split at 1 and 3
boundaries = [1, 3]
# add min and max values of your data
boundaries = sorted({a.min(), a.max() + 1} | set(boundaries))

a_discretized_1 = pd.cut(a, bins=boundaries, right=False)
a_discretized_2 = pd.cut(a, bins=boundaries, labels=range(len(boundaries) - 1), right=False)
a_discretized_3 = pd.cut(a, bins=boundaries, labels=range(len(boundaries) - 1), right=False).astype(float)
print(a, '\n')
print(a_discretized_1, '\n', a_discretized_1.dtype, '\n')
print(a_discretized_2, '\n', a_discretized_2.dtype, '\n')
print(a_discretized_3, '\n', a_discretized_3.dtype, '\n')
```
which produces:
```
[2 2 9 7 2 9 3 0 4 0]

[[1, 3), [1, 3), [3, 10), [3, 10), [1, 3), [3, 10), [3, 10), [0, 1), [3, 10), [0, 1)]
Categories (3, interval[int64]): [[0, 1) < [1, 3) < [3, 10)]
 category

[1, 1, 2, 2, 1, 2, 2, 0, 2, 0]
Categories (3, int64): [0 < 1 < 2]
 category

[1. 1. 2. 2. 1. 2. 2. 0. 2. 0.]
 float64
```
Note that, by default, pd.cut returns a pd.Series object of dtype Category with elements of type interval[int64]. If you specify your own labels, the dtype of the output will still be a Category, but the elements will be of type int64. If you want the series to have a numeric dtype, you can use .astype(np.int64).

My example uses integer data, but it should work just as fine with floats.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...