Spark: OneHot encoder and storing Pipeline (feature dimension issue)

最后都变了- 提交于 2019-11-29 15:32:47

Spark >= 2.3

Spark 2.3 introduces OneHotEncoderEstimator (to be renamed as OneHotEncoder in Spark 3.0) which can be used directly, and supports multiple input columns.

Spark < 2.3

You don't use OneHotEncoder as it is intended to be used. OneHotEncoder is a Transofrmer not an Estimator. It doesn't store any information about the levels but depends on the Column metadata to determine output dimensions. If metadata is missing, like in your case, it uses fallback strategy and assumes there is max(input_column) levels. Serialization is irrelevant here.

Typical usage involves Transformers in the upstream Pipeline, which set metadata for you. One common example is StringIndexer.

It is still possible to set metadata manually, but it is more involved:

import org.apache.spark.ml.attribute.NominalAttribute

val meta = NominalAttribute.defaultAttr
  .withName("class")
  .withValues("0", (1 to 5).map(_.toString): _*)
  .toMetadata

loadedModel.transform(df2.select($"class".as("class", meta), $"output"))

Similarly in Python (needs Spark >= 2.2):

from pyspark.sql.functions import col

meta = {"ml_attr": {
    "vals": [str(x) for x in range(6)],   # Provide a set of levels
    "type": "nominal", 
    "name": "class"}}

loaded.transform(
    df.withColumn("class", col("class").alias("class", metadata=meta))
)

Metadata can be also attached using a number of different methods: How to change column metadata in pyspark?.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!