How to transform a categorical variable in Spark into a set of columns coded as {0,1}?

后端 未结 4 730
终归单人心
终归单人心 2020-12-29 14:42

I\'m trying to perform a logistic regression (LogisticRegressionWithLBFGS) with Spark MLlib (with Scala) on a dataset which contains categorical variables. I discover Spark

4条回答
  •  孤独总比滥情好
    2020-12-29 15:07

    Using VectorIndexer, you may tell the indexer the number of different values (cardinality) that a field may have in order to be considered categorical with the setMaxCategories() method.

    val indexer = new VectorIndexer()
    .setInputCol("features")
    .setOutputCol("indexed")
    .setMaxCategories(10)
    

    From Scaladocs:

    Class for indexing categorical feature columns in a dataset of Vector.

    This has 2 usage modes:

    Automatically identify categorical features (default behavior)

    This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter.

    Set maxCategories to the maximum number of categorical any categorical feature should have.

    E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories = 2, then feature 0 will be declared categorical and use indices {0, 1}, and feature 1 will be declared continuous.

    I find this a convenient (though coarse-grained) way to extract the categorical values, but beware if in any case you have a field with lower arity that you want to be continuous (e.g. age in college students vs origin country or US-state).

提交回复
热议问题