text classification with SciKit-learn and a large dataset

First of all I started with python yesterday. I'm trying to do text classification with SciKit and a large dataset (250.000 tweets). For the algorithm, every tweet will be represented as a 4000 x 1 vector, so this means the input is 250.000 rows and 4000 columns. When i try to construct this in python, I run out of memory after 8500 tweets (when working with a list and appending it) and when I preallocate the memory I just get the error: MemoryError (np.zeros(4000,2500000)). Is SciKit not able to work with these large datasets \? Am I doing something wrong (as it is my second day with python)? Is there another way of representing the features so that it can fit in my memory ?

edit: I want the to the Bernoulli NB

edit2: Maybe it is possible with online learning ? read a tweet, let the model use the tweet, remove it from memory , read another, let the model learn... but I don't think Bernoulli NB allows for online learning in scikit-learn

I assume that these 4000 x 1 vectors are bag-of-words representations. If that is the case, then that 250000 by 4000 matrix has a lot of zeros, because in each tweet there are only very few words. This kind of matrices are called sparse matrices, and there are efficient ways of storing them in memory. See the Scipy documentation and the SciKit documentation for sparse matrices to get started; if you need more help after reading those links, post again.

If you use scikits' vectorizers (CountVectorizer or TfidfVectorizer are good as a first attempt) you get a sparse matrix representation. From the documentation:

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)
#initialize your classifier
clf.fit(X_train, y_train)

来源：https://stackoverflow.com/questions/13741460/text-classification-with-scikit-learn-and-a-large-dataset

标签

python

nlp

scikit-learn

scikits