Spark Scala - How to group dataframe rows and apply complex function to the groups?

女生的网名这么多〃 提交于 2019-11-30 05:55:07

Cosine similarity is not a complex function and can expressed using standard SQL aggregations. Let's consider following example:

val df = Seq(
  ("feat1", 1.0, "item1"),
  ("feat2", 1.0, "item1"),
  ("feat6", 1.0, "item1"),
  ("feat1", 1.0, "item2"),
  ("feat3", 1.0, "item2"),
  ("feat4", 1.0, "item3"),
  ("feat5", 1.0, "item3"),
  ("feat1", 1.0, "item4"),
  ("feat6", 1.0, "item4")
).toDF("feature", "value", "item")

where feature is a feature identifier, value is a corresponding value and item is object identifier and feature, item pair has only one corresponding value.

Cosine similarity is defined as:

where numerator can be computed as:

val numer = df.as("this").withColumnRenamed("item", "this")
  .join(df.as("other").withColumnRenamed("item", "other"), Seq("feature"))
  .where($"this" < $"other")
  .groupBy($"this", $"other")
  .agg(sum($"this.value" * $"other.value").alias("dot"))

and norms used in the denominator as:

import org.apache.spark.sql.functions.sqrt

val norms = df.groupBy($"item").agg(sqrt(sum($"value" * $"value")).alias("norm"))

// Combined together:

val cosine = ($"dot" / ($"this_norm.norm" * $"other_norm.norm")).as("cosine") 

val similarities = numer
 .join(norms.alias("this_norm").withColumnRenamed("item", "this"), Seq("this"))
 .join(norms.alias("other_norm").withColumnRenamed("item", "other"), Seq("other"))
 .select($"this", $"other", cosine)

with result representing non-zero entries of the upper triangular matrix ignoring diagonal (which is trivial):

+-----+-----+-------------------+
| this|other|             cosine|
+-----+-----+-------------------+
|item1|item4| 0.8164965809277259|
|item1|item2|0.40824829046386296|
|item2|item4| 0.4999999999999999|
+-----+-----+-------------------+

This should be equivalent to:

import org.apache.spark.sql.functions.array
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
import org.apache.spark.mllib.linalg.Vectors

val pivoted = df.groupBy("item").pivot("feature").sum()
  .na.fill(0.0)
  .orderBy("item")

val mat = new IndexedRowMatrix(pivoted
  .select(array(pivoted.columns.tail.map(col): _*))
  .rdd
  .zipWithIndex
  .map {
    case (row, idx) => 
      new IndexedRow(idx, Vectors.dense(row.getSeq[Double](0).toArray))
  })

mat.toCoordinateMatrix.transpose
  .toIndexedRowMatrix.columnSimilarities
  .toBlockMatrix.toLocalMatrix
0.0  0.408248290463863  0.0  0.816496580927726
0.0  0.0                0.0  0.4999999999999999
0.0  0.0                0.0  0.0
0.0  0.0                0.0  0.0

Regarding your code:

  • Execution is sequential because your code operates on local (collected) collection.
  • myComplexFunction cannot be further distributed because it is distributed data structures and contexts.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!