Anomaly detection with PCA in Spark

问题

I read the following article

Anomaly detection with Principal Component Analysis (PCA)

In the article is written following:

• PCA algorithm basically transforms data readings from an existing coordinate system into a new coordinate system.

• The closer data readings are to the center of the new coordinate system, the closer these readings are to an optimum value.

• The anomaly score is calculated using the Mahalanobis distance between a reading and the mean of all readings, which is the center of the transformed coordinate system.

Can anyone describe me more in detail about anomaly detection using PCA (using PCA scores and Mahalanobis distance)? I'm confused because the definition of PCA is: PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables“. How to use Mahalanobis distance when there is no more correlation between the variables?

Can anybody explain me how to do this in Spark? Does the pca.transform function returns the score where i should calculate the Mahalanobis distance for every reading to the center?

回答1:

Lets assume you have a dataset of 3-dimensional points. Each point has coordinates (x, y, z). Those (x, y, z) are dimensions. Point represented by three values e. g. (8, 7, 4). It called input vector.

When you applying PCA algorithm you basically transform your input vector to new vector. It can be represented as function that turns (x, y, z) => (v, w).

Example: (8, 7, 4) => (-4, 13)

Now you received a vector, shorter one (you reduced an nr. of dimension), but your point still has coordinates, namely (v, w). This means that you can compute the distance between two points using Mahalanobis measure. Points that have a long distance from a mean coordinate are in fact anomalies.

Example solution:

import breeze.linalg.{DenseVector, inv}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{PCA, StandardScaler, VectorAssembler}
import org.apache.spark.ml.linalg.{Matrix, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.functions._

object SparkApp extends App {
  val session = SparkSession.builder()
    .appName("spark-app").master("local[*]").getOrCreate()
  session.sparkContext.setLogLevel("ERROR")
  import session.implicits._

  val df = Seq(
    (1, 4, 0),
    (3, 4, 0),
    (1, 3, 0),
    (3, 3, 0),
    (67, 37, 0) //outlier
  ).toDF("x", "y", "z")
  val vectorAssembler = new VectorAssembler().setInputCols(Array("x", "y", "z")).setOutputCol("vector")
  val standardScalar = new StandardScaler().setInputCol("vector").setOutputCol("normalized-vector").setWithMean(true)
    .setWithStd(true)

  val pca = new PCA().setInputCol("normalized-vector").setOutputCol("pca-features").setK(2)

  val pipeline = new Pipeline().setStages(
    Array(vectorAssembler, standardScalar, pca)
  )

  val pcaDF = pipeline.fit(df).transform(df)

  def withMahalanobois(df: DataFrame, inputCol: String): DataFrame = {
    val Row(coeff1: Matrix) = Correlation.corr(df, inputCol).head

    val invCovariance = inv(new breeze.linalg.DenseMatrix(2, 2, coeff1.toArray))

    val mahalanobois = udf[Double, Vector] { v =>
      val vB = DenseVector(v.toArray)
      vB.t * invCovariance * vB
    }

    df.withColumn("mahalanobois", mahalanobois(df(inputCol)))
  }

  val withMahalanobois: DataFrame = withMahalanobois(pcaDF, "pca-features")

  session.close()
}

来源：https://stackoverflow.com/questions/49530351/anomaly-detection-with-pca-in-spark

标签

apache-spark

apache-spark-sql

apache-spark-mllib

pca

anomaly-detection