Spark rdd correct date format in scala?

落花浮王杯 提交于 2019-12-11 07:36:37

问题


This is the date value I want to use when I convert RDD to Dataframe.

Sun Jul 31 10:21:53 PDT 2016

This schema "DataTypes.DateType" throws an error.

java.util.Date is not a valid external type for schema of date

So I want to prepare RDD in advance in such a way that above schema can work. How can I correct the date format to work in conversion to dataframe?

//Schema for data frame
val schema =
                StructType(
                    StructField("lotStartDate", DateType, false) ::
                    StructField("pm", StringType, false) ::
                    StructField("wc", LongType, false) ::
                    StructField("ri", StringType, false) :: Nil)

// rowrdd : [Sun Jul 31 10:21:53 PDT 2016,"PM",11,"ABC"]
val df = spark.createDataFrame(rddRow,schema)

回答1:


Spark's DateType can be encoded from java.sql.Date, so you should convert your input RDD to use that type, e.g.:

val inputRdd: RDD[(Int, java.util.Date)] = ??? // however it's created

// convert java.util.Date to java.sql.Date:
val fixedRdd = inputRdd.map {
  case (id, date) => (id, new java.sql.Date(date.getTime))
}

// now you can convert to DataFrame given your schema:
val schema = StructType(
  StructField("id", IntegerType) :: 
  StructField("date", DateType) :: 
  Nil
)

val df = spark.createDataFrame(
  fixedRdd.map(record => Row.fromSeq(record.productIterator.toSeq)),
  schema
)

// or, even easier - let Spark figure out the schema:
val df2 = fixedRdd.toDF("id", "date")

// both will evaluate to the same schema, in this case


来源:https://stackoverflow.com/questions/48469234/spark-rdd-correct-date-format-in-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!