Why does StandardScaler not attach metadata to the output column?

社会主义新天地 提交于 2019-11-26 22:00:14

问题


I noticed that the ml StandardScaler does not attach metadata to the output column:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._

val df = spark.read.option("header", true)
  .option("inferSchema", true)
  .csv("/path/to/cars.data")

val strId1 = new StringIndexer()
  .setInputCol("v7")
  .setOutputCol("v7_IDX")
val strId2 = new StringIndexer()
  .setInputCol("v8")
  .setOutputCol("v8_IDX")

val assmbleFeatures: VectorAssembler = new VectorAssembler()
  .setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7_IDX"))
  .setOutputCol("featuresRaw")

val scalerModel = new StandardScaler()
  .setInputCol("featuresRaw")
  .setOutputCol("scaledFeatures")


val plm = new Pipeline()
  .setStages(Array(strId1, strId2, assmbleFeatures, scalerModel))
  .fit(df)

val dft = plm.transform(df)

dft.schema("scaledFeatures").metadata

Gives:

res1: org.apache.spark.sql.types.Metadata = {}

This example works on this dataset (just adapt path in code above).

Is there a specific reason for this? Is it likely that this feature will be added to Spark in the future? Any suggestions for a workaround that does not include duplicating the StandardScaler?


回答1:


While discarding metadata is probably not the most fortunate choice, scaling indexed categorical features doesn't make any sense. Values returned by the StringIndexer are just labels.

If you want to scale numerical features, it should be a separate stage:

val numericAssembler: VectorAssembler = new VectorAssembler()
  .setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6"))
  .setOutputCol("numericFeatures")

val scaler = new StandardScaler()
  .setInputCol("numericFeatures")
  .setOutputCol("scaledNumericFeatures")

val finalAssembler: VectorAssembler = new VectorAssembler() 
  .setInputCols(Array("scaledNumericFeatures", "v7_IDX"))
  .setOutputCol("features")

new Pipeline()
  .setStages(Array(strId1, strId2, numericAssembler, scaler, finalAssembler))
  .fit(df)

Keeping in mind concerns raised at the beginning of this answer, you can also try copying the metadata:

val result = plm.transform(df).transform(df => 
  df.withColumn(
   "scaledFeatures", 
   $"scaledFeatures".as(
     "scaledFeatures", 
     df.schema("featuresRaw").metadata)))

esult.schema("scaledFeatures").metadata
{"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"v0"},{"idx":1,"name":"v1"},{"idx":2,"name":"v2"},{"idx":3,"name":"v3"},{"idx":4,"name":"v4"},{"idx":5,"name":"v5"},{"idx":6,"name":"v6"}],"nominal":[{"vals":["ford","chevrolet","plymouth","dodge","amc","toyota","datsun","vw","buick","pontiac","honda","mazda","mercury","oldsmobile","peugeot","fiat","audi","chrysler","volvo","opel","subaru","saab","mercedes","renault","cadillac","bmw","triumph","hi","capri","nissan"],"idx":7,"name":"v7_IDX"}]},"num_attrs":8}}


来源:https://stackoverflow.com/questions/44651418/why-does-standardscaler-not-attach-metadata-to-the-output-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!