text classification with SciKit-learn and a large dataset

£可爱£侵袭症+ 提交于 2019-12-21 05:56:14

问题


First of all I started with python yesterday. I'm trying to do text classification with SciKit and a large dataset (250.000 tweets). For the algorithm, every tweet will be represented as a 4000 x 1 vector, so this means the input is 250.000 rows and 4000 columns. When i try to construct this in python, I run out of memory after 8500 tweets (when working with a list and appending it) and when I preallocate the memory I just get the error: MemoryError (np.zeros(4000,2500000)). Is SciKit not able to work with these large datasets \? Am I doing something wrong (as it is my second day with python)? Is there another way of representing the features so that it can fit in my memory ?

edit: I want the to the Bernoulli NB

edit2: Maybe it is possible with online learning ? read a tweet, let the model use the tweet, remove it from memory , read another, let the model learn... but I don't think Bernoulli NB allows for online learning in scikit-learn


回答1:


I assume that these 4000 x 1 vectors are bag-of-words representations. If that is the case, then that 250000 by 4000 matrix has a lot of zeros, because in each tweet there are only very few words. This kind of matrices are called sparse matrices, and there are efficient ways of storing them in memory. See the Scipy documentation and the SciKit documentation for sparse matrices to get started; if you need more help after reading those links, post again.




回答2:


If you use scikits' vectorizers (CountVectorizer or TfidfVectorizer are good as a first attempt) you get a sparse matrix representation. From the documentation:

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)
#initialize your classifier
clf.fit(X_train, y_train)


来源:https://stackoverflow.com/questions/13741460/text-classification-with-scikit-learn-and-a-large-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!