Convert RDD to Dataframe in Spark/Scala

匿名 (未验证) 提交于 2019-12-03 02:45:02

问题:

The RDD has been created in the format Array[Array[String]] and has the following values:

 Array[Array[String]] = Array(Array(4580056797, 0, 2015-07-29 10:38:42, 0, 1, 1), Array(4580056797, 0, 2015-07-29 10:38:42, 0, 1, 1), Array(4580056797, 0, 2015-07-29 10:38:42, 0, 1, 1), Array(4580057445, 0, 2015-07-29 10:40:37, 0, 1, 1), Array(4580057445, 0, 2015-07-29 10:40:37, 0, 1, 1)) 

I want to create a dataFrame with the schema :

val schemaString = "callId oCallId callTime duration calltype swId" 

Next steps:

scala> val rowRDD = rdd.map(p => Array(p(0), p(1), p(2),p(3),p(4),p(5).trim)) rowRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[14] at map at <console>:39 scala> val calDF = sqlContext.createDataFrame(rowRDD, schema) 

Gives the following error:

console:45: error: overloaded method value createDataFrame with alternatives: (rdd: org.apache.spark.api.java.JavaRDD[],beanClass: Class[])org.apache.spark.sql.DataFrame (rdd: org.apache.spark.rdd.RDD[],beanClass: Class[])org.apache.spark.sql.DataFrame (rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame cannot be applied to (org.apache.spark.rdd.RDD[Array[String]],
org.apache.spark.sql.types.StructType) val calDF = sqlContext.createDataFrame(rowRDD, schema)

回答1:

Just paste into a spark-shell:

val a =    Array(     Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"),      Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"))  val rdd = sc.makeRDD(a)  case class X(callId: String, oCallId: String,    callTime: String, duration: String, calltype: String, swId: String) 

Then map() over the RDD to create instances of the case class, and then create the DataFrame using toDF():

scala> val df = rdd.map {    case Array(s0, s1, s2, s3, s4, s5) => X(s0, s1, s2, s3, s4, s5) }.toDF() df: org.apache.spark.sql.DataFrame =    [callId: string, oCallId: string, callTime: string,      duration: string, calltype: string, swId: string] 

This infers the schema from the case class.

Then you can proceed with:

scala> df.printSchema() root  |-- callId: string (nullable = true)  |-- oCallId: string (nullable = true)  |-- callTime: string (nullable = true)  |-- duration: string (nullable = true)  |-- calltype: string (nullable = true)  |-- swId: string (nullable = true)  scala> df.show() +----------+-------+-------------------+--------+--------+----+ |    callId|oCallId|           callTime|duration|calltype|swId| +----------+-------+-------------------+--------+--------+----+ |4580056797|      0|2015-07-29 10:38:42|       0|       1|   1| |4580056797|      0|2015-07-29 10:38:42|       0|       1|   1| +----------+-------+-------------------+--------+--------+----+ 

If you want to use toDF() in a normal program (not in the spark-shell), make sure (quoted from here):

  • To import sqlContext.implicits._ right after creating the SQLContext
  • Define the case class outside of the method using toDF()


回答2:

You need to convert first you Array into Row and then define schema. I made assumption that most of your fields are Long

    val rdd: RDD[Array[String]] = ???     val rows: RDD[Row] = rdd map {       case Array(callId, oCallId, callTime, duration, swId) =>         Row(callId.toLong, oCallId.toLong, callTime, duration.toLong, swId.toLong)     }      object schema {       val callId = StructField("callId", LongType)       val oCallId = StructField("oCallId", StringType)       val callTime = StructField("callTime", StringType)       val duration = StructField("duration", LongType)       val swId = StructField("swId", LongType)        val struct = StructType(Array(callId, oCallId, callTime, duration, swId))     }      sqlContext.createDataFrame(rows, schema.struct) 


回答3:

I assume that your schema is, like in the Spark Guide, as follow:

val schema =   StructType(     schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) 

If you look at the signature of the createDataFrame, here is the one that accepts a StructType as 2nd argument (for Scala)

def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame

Creates a DataFrame from an RDD containing Rows using the given schema.

So it accepts as 1st argument a RDD[Row]. What you have in rowRDD is a RDD[Array[String]] so there is a mismatch.

Do you need an RDD[Array[String]] ?

Otherwise you can use the following to create your dataframe:

val rowRDD = rdd.map(p => Row(p(0), p(1), p(2),p(3),p(4),p(5).trim)) 


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!