Change the timestamp to UTC format in Pyspark

后端 未结 2 1667
逝去的感伤
逝去的感伤 2020-12-16 08:14

I have an input dataframe(ip_df), data in this dataframe looks like as below:

id            timestamp_value
1       2017-08-01T14:30:00+05:3         


        
相关标签:
2条回答
  • 2020-12-16 08:43

    You can use parser and tz in dateutil library.
    I assume you have Strings and you want a String Column :

    from dateutil import parser, tz
    from pyspark.sql.types import StringType
    from pyspark.sql.functions import col, udf
    
    # Create UTC timezone
    utc_zone =  tz.gettz('UTC')
    
    # Create UDF function that apply on the column
    # It takes the String, parse it to a timestamp, convert to UTC, then convert to String again
    func = udf(lambda x: parser.parse(x).astimezone(utc_zone).isoformat(),  StringType())
    
    # Create new column in your dataset
    df = df.withColumn("new_timestamp",func(col("timestamp_value")))
    

    It gives this result :

    <pre>
    +---+-------------------------+-------------------------+
    |id |timestamp_value          |new_timestamp            |
    +---+-------------------------+-------------------------+
    |1  |2017-08-01T14:30:00+05:30|2017-08-01T09:00:00+00:00|
    |2  |2017-08-01T14:30:00+06:30|2017-08-01T08:00:00+00:00|
    |3  |2017-08-01T14:30:00+07:30|2017-08-01T07:00:00+00:00|
    +---+-------------------------+-------------------------+
    </pre>
    

    Finally you can drop and rename :

    df = df.drop("timestamp_value").withColumnRenamed("new_timestamp","timestamp_value")
    
    0 讨论(0)
  • 2020-12-16 08:44

    If you absolutely need the timestamp to be formatted exactly as indicated, namely, with the timezone represented as "+00:00", I think using a UDF as already suggested is your best option.

    However, if you can tolerate a slightly different representation of the timezone, e.g. either "+0000" (no colon separator) or "Z", it's possible to do this without a UDF, which may perform significantly better for you depending on the size of your dataset.

    Given the following representation of data

    +---+-------------------------+
    |id |timestamp_value          |
    +---+-------------------------+
    |1  |2017-08-01T14:30:00+05:30|
    |2  |2017-08-01T14:30:00+06:30|
    |3  |2017-08-01T14:30:00+07:30|
    +---+-------------------------+
    

    as given by:

    l = [(1, '2017-08-01T14:30:00+05:30'), (2, '2017-08-01T14:30:00+06:30'), (3, '2017-08-01T14:30:00+07:30')]
    ip_df = spark.createDataFrame(l, ['id', 'timestamp_value'])
    

    where timestamp_value is a String, you could do the following (this uses to_timestamp and session local timezone support which were introduced in Spark 2.2):

    from pyspark.sql.functions import to_timestamp, date_format
    spark.conf.set('spark.sql.session.timeZone', 'UTC')
    op_df = ip_df.select(
        date_format(
            to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"), 
            "yyyy-MM-dd'T'HH:mm:ssZ"
        ).alias('timestamp_value'))
    

    which yields:

    +------------------------+
    |timestamp_value         |
    +------------------------+
    |2017-08-01T09:00:00+0000|
    |2017-08-01T08:00:00+0000|
    |2017-08-01T07:00:00+0000|
    +------------------------+
    

    or, slightly differently:

    op_df = ip_df.select(
        date_format(
            to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"), 
            "yyyy-MM-dd'T'HH:mm:ssXXX"
        ).alias('timestamp_value'))
    

    which yields:

    +--------------------+
    |timestamp_value     |
    +--------------------+
    |2017-08-01T09:00:00Z|
    |2017-08-01T08:00:00Z|
    |2017-08-01T07:00:00Z|
    +--------------------+
    
    0 讨论(0)
提交回复
热议问题