Calculate Cosine Similarity Spark Dataframe

匿名 (未验证) 提交于 2019-12-03 02:28:01

问题:

I am using Spark Scala to calculate cosine similarity between the Dataframe rows.

Dataframe format is below

root     |-- SKU: double (nullable = true)     |-- Features: vector (nullable = true)

Sample of the dataframe below

    +-------+--------------------+     |    SKU|            Features|     +-------+--------------------+     | 9970.0|[4.7143,0.0,5.785...|     |19676.0|[5.5,0.0,6.4286,4...|     | 3296.0|[4.7143,1.4286,6....|     |13658.0|[6.2857,0.7143,4....|     |    1.0|[4.2308,0.7692,5....|     |  513.0|[3.0,0.0,4.9091,5...|     | 3753.0|[5.9231,0.0,4.846...|     |14967.0|[4.5833,0.8333,5....|     | 2803.0|[4.2308,0.0,4.846...|     |11879.0|[3.1429,0.0,4.5,4...|     +-------+--------------------+

I tried to transpose the matrix and check the following mentioned links.Apache Spark Python Cosine Similarity over DataFrames, calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf But I believe there is a better solution

I am tried the below sample code

val irm = new IndexedRowMatrix(inClusters.rdd.map {   case (v,i:Vector) => IndexedRow(v, i)   }).toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities

But I got the below error

Error:(80, 12) constructor cannot be instantiated to expected type;  found   : (T1, T2)  required: org.apache.spark.sql.Row       case (v,i:Vector) => IndexedRow(v, i)

I checked the following Link Apache Spark: How to create a matrix from a DataFrame? But can't do it using Scala

回答1:

  • DataFrame.rdd returns RDD[Row] not RDD[(T, U)]. You have to pattern match the Row or directly extract interesting parts.
  • ml Vector used with Datasets since Spark 2.0 is not the same as mllib Vector use by old API. You have to convert it to use with IndexedRowMatrix.
  • Index has to be Long not string.
import org.apache.spark.sql.Row  val irm = new IndexedRowMatrix(inClusters.rdd.map {   Row(_, v: org.apache.spark.ml.linalg.Vector) =>      org.apache.spark.mllib.linalg.Vectors.fromML(v) }.zipWithIndex.map { case (v, i) => IndexedRow(i, v) })


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!