Convert an org.apache.spark.mllib.linalg.Vector RDD to a DataFrame in Spark using Scala

我的梦境 提交于 2019-12-06 04:32:44

问题


I have a org.apache.spark.mllib.linalg.Vector RDD that [Int Int Int] . I am trying to convert this into a dataframe using this code

import sqlContext.implicits._
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.DataTypes
import org.apache.spark.sql.types.ArrayData

vectrdd belongs to the type org.apache.spark.mllib.linalg.Vector

val vectarr = vectrdd.toArray()
case class RFM(Recency: Integer, Frequency: Integer, Monetary: Integer)
val df = vectarr.map { case Array(p0, p1, p2) => RFM(p0, p1, p2) }.toDF()

I am getting the following error

warning: fruitless type test: a value of type         
org.apache.spark.mllib.linalg.Vector cannot also be a Array[T]
val df = vectarr.map { case Array(p0, p1, p2) => RFM(p0, p1, p2) }.toDF()

error: pattern type is incompatible with expected type;
found   : Array[T]
required: org.apache.spark.mllib.linalg.Vector
val df = vectarr.map { case Array(p0, p1, p2) => RFM(p0, p1, p2) }.toDF()

The second method i tried is this

val vectarr=vectrdd.toArray().take(2)
case class RFM(Recency: String, Frequency: String, Monetary: String)
val df = vectrdd.map { case (t0, t1, t2) => RFM(p0, p1, p2) }.toDF()

I got this error

error: constructor cannot be instantiated to expected type;
found   : (T1, T2, T3)
required: org.apache.spark.mllib.linalg.Vector
val df = vectrdd.map { case (t0, t1, t2) => RFM(p0, p1, p2) }.toDF()

I used this example as a guide >> Convert RDD to Dataframe in Spark/Scala


回答1:


vectarr will have type of Array[org.apache.spark.mllib.linalg.Vector], so in the pattern matching you cannot match Array(p0, p1, p2) because what is being matched is a Vector, not Array.

Also, you should not do val vectarr = vectrdd.toArray() - this will convert the RDD to Array and then the final call to toDF will not work, since toDF only works on RDD's.

The correct line would be (provided you change RFM to have Doubles)

val df = vectrdd.map(_.toArray).map { case Array(p0, p1, p2) => RFM(p0, p1, p2)}.toDF()

or, equivalently, replace val vectarr = vectrdd.toArray() (which produces Array[Vector]) with val arrayRDD = vectrdd.map(_.toArray()) (producing RDD[Array[Double]])



来源:https://stackoverflow.com/questions/34688258/convert-an-org-apache-spark-mllib-linalg-vector-rdd-to-a-dataframe-in-spark-usin

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!