Better way to convert a string field into timestamp in Spark

前端 未结 7 883
独厮守ぢ
独厮守ぢ 2020-11-27 16:29

I have a CSV in which a field is datetime in a specific format. I cannot import it directly in my Dataframe because it needs to be a timestamp. So I import it as string and

7条回答
  •  甜味超标
    2020-11-27 17:04

    Spark >= 2.2

    Since you 2.2 you can provide format string directly:

    import org.apache.spark.sql.functions.to_timestamp
    
    val ts = to_timestamp($"dts", "MM/dd/yyyy HH:mm:ss")
    
    df.withColumn("ts", ts).show(2, false)
    
    // +---+-------------------+-------------------+
    // |id |dts                |ts                 |
    // +---+-------------------+-------------------+
    // |1  |05/26/2016 01:01:01|2016-05-26 01:01:01|
    // |2  |#$@#@#             |null               |
    // +---+-------------------+-------------------+
    

    Spark >= 1.6, < 2.2

    You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:

    val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$@#@#")).toDF("id", "dts")
    

    You can use unix_timestamp to parse strings and cast it to timestamp

    import org.apache.spark.sql.functions.unix_timestamp
    
    val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
    
    df.withColumn("ts", ts).show(2, false)
    
    // +---+-------------------+---------------------+
    // |id |dts                |ts                   |
    // +---+-------------------+---------------------+
    // |1  |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
    // |2  |#$@#@#             |null                 |
    // +---+-------------------+---------------------+
    

    As you can see it covers both parsing and error handling. The format string should be compatible with Java SimpleDateFormat.

    Spark >= 1.5, < 1.6

    You'll have to use use something like this:

    unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")
    

    or

    (unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss") * 1000).cast("timestamp")
    

    due to SPARK-11724.

    Spark < 1.5

    you should be able to use these with expr and HiveContext.

提交回复
热议问题