MemoryError in toarray when using DictVectorizer of Scikit Learn

后端未结

关注

 7  2091

無奈伤痛 2021-01-06 05:47

I am trying to implement the SelectKBest algorithm on my data to get the best features out of it. For this I am first preprocessing my data using DictVectorizer and the data

7条回答

梦毁少年i (楼主)

2021-01-06 06:27

I was using DictVectorizer to transform categorical database entries into one hot vectors and was continually getting this memory error. I was making the following fatal flaw: d = DictVectorizer(sparse=False). When I would call d.transform() on some of the fields with 2000 or more categories, python would crash. The solution that worked was to instantiate DictVectorizer with sparse being True, which by the way is default behavior. If you are doing one hot representations of items with many categories, dense arrays are not the most efficient structure to use. Calling .toArray() in this case is very inefficient.

The purpose of the one hot vector in matrix multiplication is to select a row or column from some matrix. This can be done more efficiently simply by using the indices where a 1 exists in the vector. This is an implicit form of multiplication, that requires orders of magnitude less operations than is required of the explicit multiplication.

0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...