问题
I am trying to use grid search to choose the number of principal components of the data before fitting into a linear regression. I am confused how I can make a dictionary of the number of principal components I want. I put my list into a dictionary format in the param_grid parameter, but I think I did it wrong. So far, I have gotten a warning about my array containing infs or NaNs.
I am following the instructions from pipelining a linear regression to PCA: http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html
ValueError: array must not contain infs or NaNs
I was able to get the same error on a reproducible example, my real dataset is the larger:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
df2 = pd.DataFrame({ 'C' : pd.Series(1, index = list(range(8)),dtype = 'float32'),
'D' : np.array([3] * 8,dtype = 'int32'),
'E' : pd.Categorical(["test", "train", "test", "train",
"test", "train", "test", "train"])})
df3 = pd.get_dummies(df2)
lm = LinearRegression()
pipe = [('pca',PCA(whiten=True)),
('clf' ,lm)]
pipe = Pipeline(pipe)
param_grid = {
'pca__n_components': np.arange(2,4)}
X = df3.as_matrix()
CLF = GridSearchCV(pipe, param_grid = param_grid, verbose = 1, cv = 3)
y = np.random.normal(0,1,len(X)).reshape(-1,1)
CLF.fit(X,y)
ValueError: array must not contain infs or NaNs
EDIT: I put in the y for the fit statement, but it still gave me the same error. However, this was for my dataset NOT the reproducible example.
回答1:
I could be problem with PCA implementation in scikit-learn 0.18.1.
See a bug report https://github.com/scikit-learn/scikit-learn/issues/7568
Described workaround is to use PCA with svd_solver='full'
.
So try this code:
pipe = [('pca',PCA(whiten=True,svd_solver='full')),
('clf' ,lm)]
回答2:
Here is some code I wrote. It seems to work for me. Notice that when you are calling fit
, you need to provide it with training data (i.e a Y vector).
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
df2 = pd.DataFrame({ 'C' : pd.Series(1, index = list(range(8)),dtype = 'float32'),
'D' : np.array([3] * 8,dtype = 'int32'),
'E' : pd.Categorical(["test", "train", "test", "train",
"test", "train", "test", "train"])})
df3 = pd.get_dummies(df2)
lm = LinearRegression()
pipe = [('pca',PCA(whiten=True)),
('clf' ,lm)]
pipe = Pipeline(pipe)
param_grid = {
'pca__n_components': np.arange(2,4),
}
X = df3.as_matrix()
CLF = GridSearchCV(pipe, param_grid = param_grid, verbose = 1, cv = 3)
y = np.random.normal(0,1,len(X)).reshape(-1,1)
CLF.fit(X,y)
print(CLF.best_params_)
The print statement will show you the best n_components
. Without a y, you can't calculate the RSS, and wont be able to tell what is "best".
来源:https://stackoverflow.com/questions/41230558/pca-in-sklearn-valueerror-array-must-not-contain-infs-or-nans