memory error when todense in python using CountVectorizer

问题

Here is my code and memory error when call todense(), I am using GBDT model, and wondering if anyone have good ideas how to work around memory error? Thanks.

  for feature_colunm_name in feature_columns_to_use:
    X_train[feature_colunm_name] = CountVectorizer().fit_transform(X_train[feature_colunm_name]).todense()
    X_test[feature_colunm_name] = CountVectorizer().fit_transform(X_test[feature_colunm_name]).todense()
  y_train = y_train.astype('int')
  grd = GradientBoostingClassifier(n_estimators=n_estimator, max_depth=10)
  grd.fit(X_train.values, y_train.values)

Detailed error message,

in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
...

regards, Lin

回答1:

There are multiple things wrong here:

for feature_colunm_name in feature_columns_to_use:
    X_train[feature_colunm_name] = CountVectorizer().fit_transform(X_train[feature_colunm_name]).todense()
    X_test[feature_colunm_name] = CountVectorizer().fit_transform(X_test[feature_colunm_name]).todense()

1) You are trying to assign mutliple columns (result of CountVectorizer will be a 2-d array where columns represent features) to a single column 'feature_colunm_name' of DataFrame. Thats not going to work and will produce error.

2) You are fitting the CountVectorizer again on the test data, which is wrong. You should use the same CountVectorizer object on test data that you used on trainind data and only call transform(), not fit_transform().

Something like:

cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train[feature_colunm_name])
X_test_cv = cv.transform(X_test[feature_colunm_name])

3) GradientBoostingClassifier works well with sparse data. Its not mentioned in documentation yet (seems like a mistake on the documentation).

4) You seem to be transforming multiple columns of your original data to bag-of-words form. For that you will need to use those many CountVectorizer objects and then merge all the output data into a single array which you pass to GradientBoostingClassifier.

Update:

You need to setup something like this:

# To merge sparse matrices
from scipy.sparse import hstack

result_matrix_train = None
result_matrix_test = None

for feature_colunm_name in feature_columns_to_use:
    cv = CountVectorizer()
    X_train_cv = cv.fit_transform(X_train[feature_colunm_name])

    # Merge the vector with others
    result_matrix_train = hstack((result_matrix_train, X_train_cv)) 
                          if result_matrix_train is not None else X_train_cv

    # Now transform the test data
    X_test_cv = cv.transform(X_test[feature_colunm_name])
    result_matrix_test = hstack((result_matrix_test, X_test_cv)) 
                         if result_matrix_test is not None else X_test_cv

Note: If you have other columns also which you did not process through the Countvectorizer because they are already numerical or so, which you want to merge with the result_matrix_train, you can do that too by:

result_matrix_train = hstack((result_matrix_test, X_train[other_columns].values)) 
result_matrix_test = hstack((result_matrix_test, X_test[other_columns].values))

Now use these to train:

...
grd.fit(result_matrix_train, y_train.values)

来源：https://stackoverflow.com/questions/52194375/memory-error-when-todense-in-python-using-countvectorizer

标签

python

machine-learning

scikit-learn

xgboost