Scikit NaN or infinity error message

你。 提交于 2019-12-21 06:20:32

问题


I'm importing some data from a csv file. The file has nan values flagged with text 'NA'. I import the data with:

X = genfromtxt(data, delimiter=',', dtype=float, skip_header=1)

I the use this code to replace nan with a previosly calculated column mean.

inds = np.where(np.isnan(X))
X[inds]=np.take(col_mean,inds[1])

I then run a couple of checks and get empty arrays:

np.where(np.isnan(X))
np.where(np.isinf(X))

Finally I run a scikit classifier:

RF = ensemble.RandomForestClassifier(n_estimators=100,n_jobs=-1,verbose=2)
RF.fit(X, y)

and get the following error:

  File "C:\Users\m&g\Anaconda\lib\site-packages\sklearn\ensemble\forest.py", line 257, in fit
    check_ccontiguous=True)
  File "C:\Users\m&g\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 233, in check_arrays
    _assert_all_finite(array)
  File "C:\Users\m&g\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 27, in _assert_all_finite
    raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.

Any ideas why it is telling me that there are NaN or infinity? I read this post and tried to run:

RF.fit(X.astype(float), y.astype(float))

but I get the same error.


回答1:


scikit-learn's decision trees cast their input to float32 for efficiency, but your values won't fit in that type:

>>> np.float32(8.9932064170227995e+41)
inf

The solution is to standardize prior to fitting a model with sklearn.preprocessing.StandardScaler. Don't forget to transform prior to predicting. You can use a sklearn.pipeline.Pipeline to combine standardization and classification in a single object:

rf = Pipeline([("scale", StandardScaler()),
               ("rf", RandomForestClassifier(n_estimators=100, n_jobs=-1, verbose=2))])

Or, with the current dev version/next release:

rf = make_pipeline(StandardScaler(),
                   RandomForestClassifier(n_estimators=100, n_jobs=-1, verbose=2))

(I admit the error message could be improved.)




回答2:


I come across this problem as well. But on the contrary, my problem is that there are some 'NaN' in the array.

Here is how to fix it.

from sklearn.preprocessing import Imputer
X = Imputer().fit_transform(X)
RF.fit(X, y)

Reference here: sklearn.preprocessing.Imputer



来源:https://stackoverflow.com/questions/21320456/scikit-nan-or-infinity-error-message

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!