可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
The RDD has been created in the format Array[Array[String]]
and has the following values:
Array[Array[String]] = Array(Array(4580056797, 0, 2015-07-29 10:38:42, 0, 1, 1), Array(4580056797, 0, 2015-07-29 10:38:42, 0, 1, 1), Array(4580056797, 0, 2015-07-29 10:38:42, 0, 1, 1), Array(4580057445, 0, 2015-07-29 10:40:37, 0, 1, 1), Array(4580057445, 0, 2015-07-29 10:40:37, 0, 1, 1))
I want to create a dataFrame with the schema :
val schemaString = "callId oCallId callTime duration calltype swId"
Next steps:
scala> val rowRDD = rdd.map(p => Array(p(0), p(1), p(2),p(3),p(4),p(5).trim)) rowRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[14] at map at <console>:39 scala> val calDF = sqlContext.createDataFrame(rowRDD, schema)
Gives the following error:
console:45: error: overloaded method value createDataFrame with alternatives: (rdd: org.apache.spark.api.java.JavaRDD[],beanClass: Class[])org.apache.spark.sql.DataFrame (rdd: org.apache.spark.rdd.RDD[],beanClass: Class[])org.apache.spark.sql.DataFrame (rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame cannot be applied to (org.apache.spark.rdd.RDD[Array[String]],
org.apache.spark.sql.types.StructType) val calDF = sqlContext.createDataFrame(rowRDD, schema)
回答1:
Just paste into a spark-shell
:
val a = Array( Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"), Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1")) val rdd = sc.makeRDD(a) case class X(callId: String, oCallId: String, callTime: String, duration: String, calltype: String, swId: String)
Then map()
over the RDD to create instances of the case class, and then create the DataFrame using toDF()
:
scala> val df = rdd.map { case Array(s0, s1, s2, s3, s4, s5) => X(s0, s1, s2, s3, s4, s5) }.toDF() df: org.apache.spark.sql.DataFrame = [callId: string, oCallId: string, callTime: string, duration: string, calltype: string, swId: string]
This infers the schema from the case class.
Then you can proceed with:
scala> df.printSchema() root |-- callId: string (nullable = true) |-- oCallId: string (nullable = true) |-- callTime: string (nullable = true) |-- duration: string (nullable = true) |-- calltype: string (nullable = true) |-- swId: string (nullable = true) scala> df.show() +----------+-------+-------------------+--------+--------+----+ | callId|oCallId| callTime|duration|calltype|swId| +----------+-------+-------------------+--------+--------+----+ |4580056797| 0|2015-07-29 10:38:42| 0| 1| 1| |4580056797| 0|2015-07-29 10:38:42| 0| 1| 1| +----------+-------+-------------------+--------+--------+----+
If you want to use toDF()
in a normal program (not in the spark-shell
), make sure (quoted from here):
- To
import sqlContext.implicits._
right after creating the SQLContext
- Define the case class outside of the method using
toDF()
回答2:
You need to convert first you Array
into Row
and then define schema. I made assumption that most of your fields are Long
val rdd: RDD[Array[String]] = ??? val rows: RDD[Row] = rdd map { case Array(callId, oCallId, callTime, duration, swId) => Row(callId.toLong, oCallId.toLong, callTime, duration.toLong, swId.toLong) } object schema { val callId = StructField("callId", LongType) val oCallId = StructField("oCallId", StringType) val callTime = StructField("callTime", StringType) val duration = StructField("duration", LongType) val swId = StructField("swId", LongType) val struct = StructType(Array(callId, oCallId, callTime, duration, swId)) } sqlContext.createDataFrame(rows, schema.struct)
回答3:
I assume that your schema
is, like in the Spark Guide, as follow:
val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
If you look at the signature of the createDataFrame, here is the one that accepts a StructType as 2nd argument (for Scala)
def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame
Creates a DataFrame from an RDD containing Rows using the given schema.
So it accepts as 1st argument a RDD[Row]
. What you have in rowRDD
is a RDD[Array[String]]
so there is a mismatch.
Do you need an RDD[Array[String]]
?
Otherwise you can use the following to create your dataframe:
val rowRDD = rdd.map(p => Row(p(0), p(1), p(2),p(3),p(4),p(5).trim))