Linear regression analysis with string/categorical features (variables)?

前端未结

关注

 4  1678

面向向阳花 2020-11-30 18:43

Regression algorithms seem to be working on features represented as numbers. For example:

This data set doesn\'t contain categorical features/variables. It

4条回答

小蘑菇 (楼主)

2020-11-30 19:23
In linear regression with categorical variables you should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others. This can produce singularity of a model, meaning your model just won't work. Read about it here

Idea is to use dummy variable encoding with drop_first=True, this will omit one column from each category after converting categorical variable into dummy/indicator variables. You WILL NOT lose any relevant information by doing that simply because your all point in dataset can fully be explained by rest of the features.

Here is complete code on how you can do it for your housing dataset

So you have categorical features:
```
District, Condition, Material, Security, Type
```
And one numerical features that you are trying to predict:
```
Price
```
First you need to split your initial dataset on input variables and prediction, assuming its pandas dataframe it would look like this:

Input variables:
```
X = housing[['District','Condition','Material','Security','Type']]
```
Prediction:
```
Y = housing['Price']
```
Convert categorical variable into dummy/indicator variables and drop one in each category:
```
X = pd.get_dummies(data=X, drop_first=True)
```
So now if you check shape of X with drop_first=True you will see that it has 4 columns less - one for each of your categorical variables.

You can now continue to use them in your linear model. For scikit-learn implementation it could look like this:
```
from sklearn import linear_model
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 40)

regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...