Change the timestamp to UTC format in Pyspark

后端 未结 2 1668
逝去的感伤
逝去的感伤 2020-12-16 08:14

I have an input dataframe(ip_df), data in this dataframe looks like as below:

id            timestamp_value
1       2017-08-01T14:30:00+05:3         


        
2条回答
  •  心在旅途
    2020-12-16 08:44

    If you absolutely need the timestamp to be formatted exactly as indicated, namely, with the timezone represented as "+00:00", I think using a UDF as already suggested is your best option.

    However, if you can tolerate a slightly different representation of the timezone, e.g. either "+0000" (no colon separator) or "Z", it's possible to do this without a UDF, which may perform significantly better for you depending on the size of your dataset.

    Given the following representation of data

    +---+-------------------------+
    |id |timestamp_value          |
    +---+-------------------------+
    |1  |2017-08-01T14:30:00+05:30|
    |2  |2017-08-01T14:30:00+06:30|
    |3  |2017-08-01T14:30:00+07:30|
    +---+-------------------------+
    

    as given by:

    l = [(1, '2017-08-01T14:30:00+05:30'), (2, '2017-08-01T14:30:00+06:30'), (3, '2017-08-01T14:30:00+07:30')]
    ip_df = spark.createDataFrame(l, ['id', 'timestamp_value'])
    

    where timestamp_value is a String, you could do the following (this uses to_timestamp and session local timezone support which were introduced in Spark 2.2):

    from pyspark.sql.functions import to_timestamp, date_format
    spark.conf.set('spark.sql.session.timeZone', 'UTC')
    op_df = ip_df.select(
        date_format(
            to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"), 
            "yyyy-MM-dd'T'HH:mm:ssZ"
        ).alias('timestamp_value'))
    

    which yields:

    +------------------------+
    |timestamp_value         |
    +------------------------+
    |2017-08-01T09:00:00+0000|
    |2017-08-01T08:00:00+0000|
    |2017-08-01T07:00:00+0000|
    +------------------------+
    

    or, slightly differently:

    op_df = ip_df.select(
        date_format(
            to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"), 
            "yyyy-MM-dd'T'HH:mm:ssXXX"
        ).alias('timestamp_value'))
    

    which yields:

    +--------------------+
    |timestamp_value     |
    +--------------------+
    |2017-08-01T09:00:00Z|
    |2017-08-01T08:00:00Z|
    |2017-08-01T07:00:00Z|
    +--------------------+
    

提交回复
热议问题