I encoded my categorical data using sklearn.OneHotEncoder and fed them to a random forest classifier. Everything seems to work and I got my predicted output bac
If the features are dense, like [1,2,4,5,6], with several number missed. Then, we can mapping them to corresponding positions.
>>> import numpy as np
>>> from scipy import sparse
>>> def _sparse_binary(y):
... # one-hot codes of y with scipy.sparse matrix.
... row = np.arange(len(y))
... col = y - y.min()
... data = np.ones(len(y))
... return sparse.csr_matrix((data, (row, col)))
...
>>> y = np.random.randint(-2,2, 8).reshape([4,2])
>>> y
array([[ 0, -2],
[-2, 1],
[ 1, 0],
[ 0, -2]])
>>> yc = [_sparse_binary(y[:,i]) for i in xrange(2)]
>>> for i in yc: print i.todense()
...
[[ 0. 0. 1. 0.]
[ 1. 0. 0. 0.]
[ 0. 0. 0. 1.]
[ 0. 0. 1. 0.]]
[[ 1. 0. 0. 0.]
[ 0. 0. 0. 1.]
[ 0. 0. 1. 0.]
[ 1. 0. 0. 0.]]
>>> [i.shape for i in yc]
[(4, 4), (4, 4)]
This is a compromised and simple method, but works and easy to reverse by argmax(), e.g.:
>>> np.argmax(yc[0].todense(), 1) + y.min(0)[0]
matrix([[ 0],
[-2],
[ 1],
[ 0]])