Can't do linear regression in scikit-Learn due to “reshaping” issue

后端未结

关注

 4  755

故里飘歌

I have a simple CSV with two columns:

ErrorWeek (a number for the week number in the year)
ErrorCount (for the number of errors in a given week)

相关标签:

4条回答

日久生厌

2021-01-06 16:03
Doing:
```
X_train, X_test, y_train, y_test = train_test_split(
         df['ErrorWeek'], df['ErrorCount'], random_state=0)
```
will make all output arrays of one dimension because you are choosing a single column value for X and y.

Now, when you pass a one dimensional array of [n,], Scikit-learn is not able to decide that what you have passed is one row of data with multiple columns, or multiple samples of data with single column. i.e. sklearn may not infer whether its n_samples=n and n_features=1 or other way around (n_samples=1 and n_features=n) based on X data alone.

Hence it asks you reshape the 1-D data you provided to a 2-d data of shape [n_samples, n_features]

Now there are multiple ways of doing this.
- You can do what the scikit-learn says:
  
  X_train = X_train.reshape(-1,1) X_test = X_test.reshape(-1,1)
The 1 in the second place of reshape tells that there is a single column only and -1 is to detect the number of rows automatically for this single column.
- Do as suggested in other answers by @MaxU and @Wen
0 讨论(0)
发布评论:

提交评论
- 加载中...
Happy的楠姐

2021-01-06 16:04
change your fit part
```
regr.fit(X_train[:,None], y_train)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
长情又很酷

2021-01-06 16:06
Apparently sklearn wants x to be a pandas.core.frame.DataFrame because it cannot distinguish between a single feature with n samples or n features with one sample. Instead y can be one single column, that is a pandas.core.series.Series. Therefore, in your example, you should transform x to a pandas.core.frame.DataFrame.

As already pointed out by @MaxU:
```
x=df[['ErrorWeek']]   # double brakets
y=df['ErrorCount']    # single brakets
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
甜味超标

2021-01-06 16:25
Try this:
```
X_train, X_test, y_train, y_test = train_test_split(
    df[['ErrorWeek']], df['ErrorCount'], random_state=0)
```
PS pay attention at additional square brackets: df[['ErrorWeek']]
0 讨论(0)
发布评论:

提交评论
- 加载中...