Feature normalization algorithm in Spark

前端 未结 1 561
情歌与酒
情歌与酒 2020-12-09 06:09

Trying to understand Spark\'s normalization algorithm. My small test set contains 5 vectors:

{0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0},  
{1.0, 1.0, 1.         


        
相关标签:
1条回答
  • 2020-12-09 06:56

    Your expectations are simply incorrect. As it is clearly stated in the official documentation "Normalizer scales individual samples to have unit L p norm" where default value for p is 2. Ignoring numerical precision issues:

    import org.apache.spark.mllib.linalg.Vectors
    
    val rdd = sc.parallelize(Seq(
        Vectors.dense(0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),  
        Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0),  
        Vectors.dense(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0),  
        Vectors.dense(-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0),  
        Vectors.dense(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0)))
    
    val transformed = normalizer.transform(rdd)
    transformed.map(_.toArray.sum).collect
    // Array[Double] = Array(1.0009051182149054, 1.000085713673417,
    //   0.9999142851020933, 1.00087797536153, 1.0
    

    MLLib doesn't provide functionality you need but can use StandardScaler from ML.

    import org.apache.spark.ml.feature.StandardScaler
    
    val df = rdd.map(Tuple1(_)).toDF("features")
    
    val scaler = new StandardScaler()
      .setInputCol("features")
      .setOutputCol("scaledFeatures")
      .setWithStd(true)
      .setWithMean(true)
    
    val transformedDF =  scaler.fit(df).transform(df)
    
    transformedDF.select($"scaledFeatures")show(5, false)
    
    // +--------------------------------------------------------------------------------------------------------------------------+
    // |scaledFeatures                                                                                                            |
    // +--------------------------------------------------------------------------------------------------------------------------+
    // |[0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.0910691283447955,0.0]                |
    // |[1.0253040317020319,1.4038947727833362,1.414213562373095,-0.6532797101459693,-0.6532797101459693,-0.6010982697825494,0.0] |
    // |[-1.0253040317020319,-1.4242574689236265,-1.414213562373095,-0.805205224133404,-0.805205224133404,-0.8536605680105113,0.0]|
    // |[-0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.0910691283447955,0.0]               |
    // |[0.0,-0.010181348070145075,0.0,-0.7292424671396867,-0.7292424671396867,-0.7273794188965303,0.0]                           |
    // +--------------------------------------------------------------------------------------------------------------------------+
    
    0 讨论(0)
提交回复
热议问题