text classification with SciKit-learn and a large dataset

落花浮王杯 提交于 2019-12-03 21:08:01

I assume that these 4000 x 1 vectors are bag-of-words representations. If that is the case, then that 250000 by 4000 matrix has a lot of zeros, because in each tweet there are only very few words. This kind of matrices are called sparse matrices, and there are efficient ways of storing them in memory. See the Scipy documentation and the SciKit documentation for sparse matrices to get started; if you need more help after reading those links, post again.

If you use scikits' vectorizers (CountVectorizer or TfidfVectorizer are good as a first attempt) you get a sparse matrix representation. From the documentation:

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)
#initialize your classifier
clf.fit(X_train, y_train)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!