Feature normalization algorithm in Spark

狂风中的少年 提交于 2019-11-28 07:47:24

Your expectations are simply incorrect. As it is clearly stated in the official documentation "Normalizer scales individual samples to have unit L p norm" where default value for p is 2. Ignoring numerical precision issues:

import org.apache.spark.mllib.linalg.Vectors

val rdd = sc.parallelize(Seq(
    Vectors.dense(0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),  
    Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0),  
    Vectors.dense(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0),  
    Vectors.dense(-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),  
    Vectors.dense(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0)))

val transformed = normalizer.transform(rdd)
transformed.map(_.toArray.sum).collect
// Array[Double] = Array(1.0009051182149054, 1.000085713673417,
//   0.9999142851020933, 1.00087797536153, 1.0

MLLib doesn't provide functionality you need but can use StandardScaler from ML.

import org.apache.spark.ml.feature.StandardScaler

val df = rdd.map(Tuple1(_)).toDF("features")

val scaler = new StandardScaler()
  .setInputCol("features")
  .setOutputCol("scaledFeatures")
  .setWithStd(true)
  .setWithMean(true)

val transformedDF =  scaler.fit(df).transform(df)

transformedDF.select($"scaledFeatures")show(5, false)

// +--------------------------------------------------------------------------------------------------------------------------+
// |scaledFeatures                                                                                                            |
// +--------------------------------------------------------------------------------------------------------------------------+
// |[0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.0910691283447955,0.0]                |
// |[1.0253040317020319,1.4038947727833362,1.414213562373095,-0.6532797101459693,-0.6532797101459693,-0.6010982697825494,0.0] |
// |[-1.0253040317020319,-1.4242574689236265,-1.414213562373095,-0.805205224133404,-0.805205224133404,-0.8536605680105113,0.0]|
// |[-0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.0910691283447955,0.0]               |
// |[0.0,-0.010181348070145075,0.0,-0.7292424671396867,-0.7292424671396867,-0.7273794188965303,0.0]                           |
// +--------------------------------------------------------------------------------------------------------------------------+
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!