Tagging columns as Categorical in Spark

五迷三道 提交于 2020-01-02 10:18:34

问题


I am currently using StringIndexer to convert lot of columns into unique integers for classification in RandomForestModel. I am also using a pipeline for the ML process.

Some queries are

  1. How does the RandomForestModel know which columns are categorical. StringIndexer converts non--numerical to numerical but does it add some meta-data of somesort to indicate that it is a categorical column? In mllib.tree.RF there was parameter call categoricalInfo which indicated columns which are categorical. How does ml.tree.RF know which are since that is not present.

  2. Also, StringIndexer maps categories to integers based on frequency of occurences. Now, when new data comes in, how do I make sure that this data is encoded consistently with training data? I sit possible o do that without StringIndexing the whole data again including the new data?

I quite confused on how to implement this.


回答1:


Is it possible o do that without StringIndexing the whole data again including the new data?

Yes, it is possible. You just have to use an indexer fitted on a training data. If you use ML pipelines it will be handled for you just use StringIndexerModel directly:

import org.apache.spark.ml.feature.StringIndexer

val train = sc.parallelize(Seq((1, "a"), (2, "a"), (3, "b"))).toDF("x", "y")
val test  = sc.parallelize(Seq((1, "a"), (2, "b"), (3, "b"))).toDF("x", "y")

val indexer = new StringIndexer()
  .setInputCol("y")
  .setOutputCol("y_index")
  .fit(train)

indexer.transform(train).show

// +---+---+-------+
// |  x|  y|y_index|
// +---+---+-------+
// |  1|  a|    0.0|
// |  2|  a|    0.0|
// |  3|  b|    1.0|
// +---+---+-------+

indexer.transform(test).show

// +---+---+-------+
// |  x|  y|y_index|
// +---+---+-------+
// |  1|  a|    0.0|
// |  2|  b|    1.0|
// |  3|  b|    1.0|
// +---+---+-------+

One possible caveat is that it doesn't handle gracefully unseen labels so you have to drop these before transforming.

How does the RandomForestModel know which columns are categorical.

Different ML transformers add specialspecial metadata to the transformed columns which indicate type of the column, number of classes, etc.

import org.apache.spark.ml.attribute._
import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
  .setInputCols(Array("x", "y_index"))
  .setOutputCol("features")

val transformed = assembler.transform(indexer.transform(train))
val meta = AttributeGroup.fromStructField(transformed.schema("features"))
meta.attributes.get

// Array[org.apache.spark.ml.attribute.Attribute] = Array(
//   {"type":"numeric","idx":0,"name":"x"},
//   {"vals":["a","b"],"type":"nominal","idx":1,"name":"y_index"})

or

transformed.select($"features").schema.fields.last.metadata
// "ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"x"}], 
//  "nominal":[{"vals":["a","b"],"idx":1,"name":"y_index"}]},"num_attrs":2}}


来源:https://stackoverflow.com/questions/34070050/tagging-columns-as-categorical-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!