Feature preprocessing of both continuous and categorical variables (of integer type) with scikit-learn

问题

The main goals are as follows:

1) Apply StandardScaler to continuous variables

2) Apply LabelEncoder and OnehotEncoder to categorical variables

The continuous variables need to be scaled, but at the same time, a couple of categorical variables are also of integer type. Applying StandardScaler would result in undesired effects.

On the flip side, the StandardScaler would scale the integer based categorical variables, which is also not we what.

Since continuous variables and categorical ones are mixed in a single Pandas DataFrame, what's the recommended workflow to approach this kind of problem?

The best example to illustrate my point is the Kaggle Bike Sharing Demand dataset, where season and weather are integer categorical variables

回答1:

Check out the sklearn_pandas.DataFrameMapper meta-transformer. Use it as the first step in your pipeline to perform column-wise data engineering operations:

mapper = DataFrameMapper(
  [(continuous_col, StandardScaler()) for continuous_col in continuous_cols] +
  [(categorical_col, LabelBinarizer()) for categorical_col in categorical_cols]
)
pipeline = Pipeline(
  [("mapper", mapper),
  ("estimator", estimator)]
)
pipeline.fit_transform(df, df["y"])

Also, you should be using sklearn.preprocessing.LabelBinarizer instead of a list of [LabelEncoder(), OneHotEncoder()].

来源：https://stackoverflow.com/questions/43554821/feature-preprocessing-of-both-continuous-and-categorical-variables-of-integer-t

标签

python

pandas

machine-learning

scikit-learn

categorical-data

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!