Using scikit-learn (sklearn), how to handle missing data for linear regression?

给你一囗甜甜゛ 提交于 2020-06-27 07:23:14

问题


I tried this but couldn't get it to work for my data: Use Scikit Learn to do linear regression on a time series pandas data frame

My data consists of 2 DataFrames. DataFrame_1.shape = (40,5000) and DataFrame_2.shape = (40,74). I'm trying to do some type of linear regression, but DataFrame_2 contains NaN missing data values. When I DataFrame_2.dropna(how="any") the shape drops to (2,74).

Is there any linear regression algorithm in sklearn that can handle NaN values?

I'm modeling it after the load_boston from sklearn.datasets where X,y = boston.data, boston.target = (506,13),(506,)

Here's my simplified code:

X = DataFrame_1
for col in DataFrame_2.columns:
    y = DataFrame_2[col]
    model = LinearRegression()
    model.fit(X,y)

#ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I did the above format to get the shapes to match up of the matrices

If posting the DataFrame_2 would help, please comment below and I'll add it.


回答1:


You can fill in the null values in y with imputation. In scikit-learn this is done with the following code snippet:

from sklearn.preprocessing import Imputer
imputer = Imputer()
y_imputed = imputer.fit_transform(y)

Otherwise, you might want to build your model using a subset of the 74 columns as predictors, perhaps some of your columns contain less null values?




回答2:


If your variable is a DataFrame, you could use fillna. Here I replaced the missing data with the mean of that column.

df.fillna(df.mean(), inplace=True)


来源:https://stackoverflow.com/questions/33113947/using-scikit-learn-sklearn-how-to-handle-missing-data-for-linear-regression

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!