ValueError: negative dimensions are not allowed in scikit linear regression CV model with sparse matrices

匿名 (未验证) 提交于 2019-12-03 09:06:55

问题:

I recently competed in a kaggle competition and ran into problems trying to run linear CV models from scikit learn. I am aware of a similar question on stack overflow but I can't see how the accepted reply relates to my issue. Any assistance would be greatly appreciated. My code is given below:

train=pd.read_csv(".../train.csv") test=pd.read_csv(".../test.csv") data=pd.read_csv(".../sampleSubmission.csv")  from sklearn.feature_extraction.text import TfidfVectorizer transformer = TfidfVectorizer(max_features=None) Y=transformer.fit_transform(train.tweet) Z=transformer.transform(test.tweet)  from sklearn import linear_model  clf = linear_model.RidgeCV()  a=4 b=1 while (a<28):     clf.fit(Y, train.ix[:,a])     pred=clf.predict(Z)     linpred=pd.DataFrame(pred)     data[data.columns[b]]=linpred     b=b+1     a=a+1 print b 

The error that I receive is pasted in total below:

--------------------------------------------------------------------------- ValueError                                Traceback (most recent call last) <ipython-input-17-41c31233c15c> in <module>()       1 blah=train.ix[:,a] ----> 2 clf.fit(Y, blah)  D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-        packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)     815                                   gcv_mode=self.gcv_mode,     816                                   store_cv_values=self.store_cv_values) --> 817             estimator.fit(X, y, sample_weight=sample_weight)     818             self.alpha_ = estimator.alpha_     819             if self.store_cv_values:  D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-    packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)     722             raise ValueError('bad gcv_mode "%s"' % gcv_mode)     723  --> 724         v, Q, QT_y = _pre_compute(X, y)     725         n_y = 1 if len(y.shape) == 1 else y.shape[1]     726         cv_values = np.zeros((n_samples * n_y, len(self.alphas)))  D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-  packages\sklearn\linear_model\ridge.pyc in _pre_compute(self, X, y)     607     def _pre_compute(self, X, y):     608         # even if X is very sparse, K is usually very dense --> 609         K = safe_sparse_dot(X, X.T, dense_output=True)     610         v, Q = linalg.eigh(K)     611         QT_y = np.dot(Q.T, y)  D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\utils\extmath.pyc in safe_sparse_dot(a, b, dense_output)      76     from scipy import sparse      77     if sparse.issparse(a) or sparse.issparse(b): ---> 78         ret = a * b      79         if dense_output and hasattr(ret, "toarray"):      80             ret = ret.toarray()  D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-packages\scipy\sparse\base.pyc in __mul__(self, other)     301             if self.shape[1] != other.shape[0]:     302                 raise ValueError('dimension mismatch') --> 303             return self._mul_sparse_matrix(other)     304      305         try:  D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-  packages\scipy\sparse\compressed.pyc in _mul_sparse_matrix(self, other)     518      519         nnz = indptr[-1] --> 520         indices = np.empty(nnz, dtype=np.intc)     521         data = np.empty(nnz, dtype=upcast(self.dtype,other.dtype))     522   ValueError: negative dimensions are not allowed 

回答1:

It looks like this problem occurs without using sklearn. Its in scipy.sparse matrix multiplication. There is this issue on a scipy-users board: sparse matrix multiplication problem. The crux of the problem is that scipy uses a 32-bit int for non-zero indices during sparse matrix multiplication. That's the marked line at the bottom of the traceback above. That can overflow if there are too many non-zero elements. That overflow causes the variable nnz to become negative. Then the code at the last arrow creates an empty array of size nnz, resulting in a ValueError due to a negative dimension.

You can generate the tail end of the traceback above without sklearn as follows:

import scipy.sparse as ss X = ss.rand(75000, 42000, format='csr', density=0.01) X * X.T 

For this problem, the input is probably quite sparse, but RidgeCV looks like its multiplying X and X.T in the last part of the traceback within sklearn. That product might not be sparse enough.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!