My data consists of a mix of continuous and categorical features. Below is a small snippet of how my data looks like in the csv format (Consider it as data collected by a su
Update (Sep 2018): As of version 0.20.0, there is a function, sklearn.preprocessing.KBinsDiscretizer, which provides discretization of continuous features using a few different strategies:
Unfortunately, at the moment, the function does not accept custom intervals (which is a bummer for me as that is what I wanted and the reason I ended up here). If you want to achieve the same, you can use Pandas function cut:
import numpy as np
import pandas as pd
n_samples = 10
a = np.random.randint(0, 10, n_samples)
# say you want to split at 1 and 3
boundaries = [1, 3]
# add min and max values of your data
boundaries = sorted({a.min(), a.max() + 1} | set(boundaries))
a_discretized_1 = pd.cut(a, bins=boundaries, right=False)
a_discretized_2 = pd.cut(a, bins=boundaries, labels=range(len(boundaries) - 1), right=False)
a_discretized_3 = pd.cut(a, bins=boundaries, labels=range(len(boundaries) - 1), right=False).astype(float)
print(a, '\n')
print(a_discretized_1, '\n', a_discretized_1.dtype, '\n')
print(a_discretized_2, '\n', a_discretized_2.dtype, '\n')
print(a_discretized_3, '\n', a_discretized_3.dtype, '\n')
which produces:
[2 2 9 7 2 9 3 0 4 0]
[[1, 3), [1, 3), [3, 10), [3, 10), [1, 3), [3, 10), [3, 10), [0, 1), [3, 10), [0, 1)]
Categories (3, interval[int64]): [[0, 1) < [1, 3) < [3, 10)]
category
[1, 1, 2, 2, 1, 2, 2, 0, 2, 0]
Categories (3, int64): [0 < 1 < 2]
category
[1. 1. 2. 2. 1. 2. 2. 0. 2. 0.]
float64
Note that, by default, pd.cut returns a pd.Series object of dtype Category with elements of type interval[int64]. If you specify your own labels, the dtype of the output will still be a Category, but the elements will be of type int64. If you want the series to have a numeric dtype, you can use .astype(np.int64).
My example uses integer data, but it should work just as fine with floats.