Spark : Parse a Date / Timestamps with different Formats (MM-dd-yyyy HH:mm, MM/dd/yy H:mm ) in same column of a Dataframe

问题

The problem is: I have a dataset where a column having 2 or more types of date format. In general I select all values as String type and then use the to_date to parse the date. But I don't know how do I parse a column having two or more types of date formats.

val DF= Seq(("02-04-2020 08:02"),("03-04-2020 10:02"),("04-04-2020 09:00"),("04/13/19 9:12"),("04/14/19 2:13"),("04/15/19 10:14"), ("04/16/19 5:15")).toDF("DOB")

import org.apache.spark.sql.functions.{to_date, to_timestamp}
val DOBDF = DF.withColumn("Date", to_date($"DOB", "MM/dd/yyyy"))

Output from the above command:

null
null
null
0019-04-13
0019-04-14
0019-04-15
0019-04-16

The code above I have written is not working for the format MM/dd/yyyy and the format which did not provided for that I am getting the null as a output.

So seeking the help to parse the file with different date formats. If possible kindly also share some tutorial or notes to the deal with the date formats. Please note: I am using Scala for the spark framework.

Thanks in advance.

回答1:

Well, Let's do it try-catch way.. Try a column conversion against each format and keep the success value. You may have to provide all possible format from outside as parameter or keep a master list of all possible formats somewhere in code itself..

Here is the possible solution.. ( Instead of SimpleDateFormatter which sometimes have issues on timestamps beyond milliseconds, I use new library - java.time.format.DateTimeFormatter)

Create a to_timestamp Function, which accepts string to convert to timestamp and all possible Formats

  import java.time.LocalDate
  import java.time.LocalDateTime
  import java.time.LocalTime
  import java.time.format.DateTimeFormatter
  import scala.util.Try

def toTimestamp(date: String, tsformats: Seq[String]): Option[java.sql.Timestamp] = {

    val out = (for (tsft <- tsformats) yield {
      val formatter = new DateTimeFormatterBuilder()
        .parseCaseInsensitive()
        .appendPattern(tsft).toFormatter()
      if (Try(java.sql.Timestamp.valueOf(LocalDateTime.parse(date, formatter))).isSuccess)
        Option(java.sql.Timestamp.valueOf(LocalDateTime.parse(date, formatter)))
      else None

    }).filter(_.isDefined)
    if (out.isEmpty) None else out.head
  }

Create a UDF on top of it - ( this udf takes Seq of Format strings as parameter)

 def UtoTimestamp(tsformats: Seq[String]) = org.apache.spark.sql.functions.udf((date: String) => toTimestamp(date, tsformats))

And now, simply use it in your spark code.. Here's the test with your Data -

    val DF = Seq(("02-04-2020 08:02"), ("03-04-2020 10:02"), ("04-04-2020 09:00"), ("04/13/19 9:12"), ("04/14/19 2:13"), ("04/15/19 10:14"), ("04/16/19 5:15")).toDF("DOB")

    val tsformats = Seq("MM-dd-yyyy HH:mm", "MM/dd/yy H:mm")

    DF.select(UtoTimestamp(tsformats)('DOB)).show

And here is the output -

+-------------------+
|           UDF(DOB)|
+-------------------+
|2020-02-04 08:02:00|
|2020-03-04 10:02:00|
|2020-04-04 09:00:00|
|2019-04-13 09:12:00|
|2019-04-14 02:13:00|
|2019-04-15 10:14:00|
|2019-04-16 05:15:00|
+-------------------+

Cherry on top would be to avoid having to write UtoTimestamp(colname) for many columns in your dataframe. Let's write a function which accepts a Dataframe, List of all Timestamp columns, And all possible formats which your source data may have coded timestamps in..

It'd parse all timestamp columns for you with trying against formats..

def WithTimestampParsed(df: DataFrame, tsCols: Seq[String], tsformats: Seq[String]): DataFrame = {

    val colSelector = df.columns.map {
      c =>
        {
          if (tsCols.contains(c)) UtoTimestamp(tsformats)(col(c)) alias (c)
          else col(c)
        }
    }

Use it like this -

// You can pass as many column names in a sequence to be parsed
WithTimestampParsed(DF, Seq("DOB"), tsformats).show

Output -

+-------------------+
|                DOB|
+-------------------+
|2020-02-04 08:02:00|
|2020-03-04 10:02:00|
|2020-04-04 09:00:00|
|2019-04-13 09:12:00|
|2019-04-14 02:13:00|
|2019-04-15 10:14:00|
|2019-04-16 05:15:00|
+-------------------+

回答2:

I put some code that maybe can help you in some way. I tried this

mport org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import java.sql.Date
import java.util.{GregorianCalendar}


object DateFormats {

  val spark = SparkSession
    .builder()
    .appName("Multiline")
    .master("local[*]")
    .config("spark.sql.shuffle.partitions", "4") //Change to a more reasonable default number of partitions for our data
    .config("spark.app.id", "Multiline")  // To silence Metrics warning
    .getOrCreate()

  val sc = spark.sparkContext

  def main(args: Array[String]): Unit = {

    Logger.getRootLogger.setLevel(Level.ERROR)


    try {

      import spark.implicits._

      val DF = Seq(("02-04-2020 08:02"),("03-04-2020 10:02"),("04-04-2020 09:00"),("04/13/19 9:12"),("04/14/19 2:13"),("04/15/19 10:14"), ("04/16/19 5:15")).toDF("DOB")

      import org.apache.spark.sql.functions.{to_date, to_timestamp}
      val DOBDF = DF.withColumn("Date", to_date($"DOB", "MM/dd/yyyy"))

      DOBDF.show()

      // todo: my code below
      DF
        .rdd
        .map(r =>{
            if(r.toString.contains("-")) {
              val dat = r.toString.substring(1,11).split("-")
              val calendar = new GregorianCalendar(dat(2).toInt,dat(1).toInt - 1,dat(0).toInt)
              (r.toString, new Date(calendar.getTimeInMillis))
            } else {
              val dat = r.toString.substring(1,9).split("/")
              val calendar = new GregorianCalendar(dat(2).toInt + 2000,dat(0).toInt - 1,dat(1).toInt)
              (r.toString, new Date(calendar.getTimeInMillis))
            }

        })
        .toDF("DOB","DATE")
        .show()

      // To have the opportunity to view the web console of Spark: http://localhost:4040/
      println("Type whatever to the console to exit......")
      scala.io.StdIn.readLine()
    } finally {
      sc.stop()
      println("SparkContext stopped.")
      spark.stop()
      println("SparkSession stopped.")
    }
  }
}

+------------------+----------+
|               DOB|      DATE|
+------------------+----------+
|[02-04-2020 08:02]|2020-04-02|
|[03-04-2020 10:02]|2020-04-03|
|[04-04-2020 09:00]|2020-04-04|
|   [04/13/19 9:12]|2019-04-13|
|   [04/14/19 2:13]|2019-04-14|
|  [04/15/19 10:14]|2019-04-15|
|   [04/16/19 5:15]|2019-04-16|
+------------------+----------+

Regards

来源：https://stackoverflow.com/questions/60986710/spark-parse-a-date-timestamps-with-different-formats-mm-dd-yyyy-hhmm-mm-d

标签

scala

date

apache-spark

datetime

apache-spark-sql