I am using Spark Scala to calculate cosine similarity between the Dataframe rows.
Dataframe format is below
root
|-- SKU: double (nullable = true)
|-- Features: vector (nullable = true)
Sample of the dataframe below
+-------+--------------------+
| SKU| Features|
+-------+--------------------+
| 9970.0|[4.7143,0.0,5.785...|
|19676.0|[5.5,0.0,6.4286,4...|
| 3296.0|[4.7143,1.4286,6....|
|13658.0|[6.2857,0.7143,4....|
| 1.0|[4.2308,0.7692,5....|
| 513.0|[3.0,0.0,4.9091,5...|
| 3753.0|[5.9231,0.0,4.846...|
|14967.0|[4.5833,0.8333,5....|
| 2803.0|[4.2308,0.0,4.846...|
|11879.0|[3.1429,0.0,4.5,4...|
+-------+--------------------+
I tried to transpose the matrix and check the following mentioned links.Apache Spark Python Cosine Similarity over DataFrames, calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf But I believe there is a better solution
I am tried the below sample code
val irm = new IndexedRowMatrix(inClusters.rdd.map {
case (v,i:Vector) => IndexedRow(v, i)
}).toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities
But I got the below error
Error:(80, 12) constructor cannot be instantiated to expected type;
found : (T1, T2)
required: org.apache.spark.sql.Row
case (v,i:Vector) => IndexedRow(v, i)
I checked the following Link Apache Spark: How to create a matrix from a DataFrame? But can't do it using Scala
DataFrame.rdd
returnsRDD[Row]
notRDD[(T, U)]
. You have to pattern match theRow
or directly extract interesting parts.ml
Vector
used withDatasets
since Spark 2.0 is not the same asmllib
Vector
use by old API. You have to convert it to use withIndexedRowMatrix
.- Index has to be
Long
not string.
import org.apache.spark.sql.Row
val irm = new IndexedRowMatrix(inClusters.rdd.map {
Row(_, v: org.apache.spark.ml.linalg.Vector) =>
org.apache.spark.mllib.linalg.Vectors.fromML(v)
}.zipWithIndex.map { case (v, i) => IndexedRow(i, v) })
来源:https://stackoverflow.com/questions/47010126/calculate-cosine-similarity-spark-dataframe