Change the timestamp to UTC format in Pyspark

后端 未结 2 1674
逝去的感伤
逝去的感伤 2020-12-16 08:14

I have an input dataframe(ip_df), data in this dataframe looks like as below:

id            timestamp_value
1       2017-08-01T14:30:00+05:3         


        
2条回答
  •  忘掉有多难
    2020-12-16 08:43

    You can use parser and tz in dateutil library.
    I assume you have Strings and you want a String Column :

    from dateutil import parser, tz
    from pyspark.sql.types import StringType
    from pyspark.sql.functions import col, udf
    
    # Create UTC timezone
    utc_zone =  tz.gettz('UTC')
    
    # Create UDF function that apply on the column
    # It takes the String, parse it to a timestamp, convert to UTC, then convert to String again
    func = udf(lambda x: parser.parse(x).astimezone(utc_zone).isoformat(),  StringType())
    
    # Create new column in your dataset
    df = df.withColumn("new_timestamp",func(col("timestamp_value")))
    

    It gives this result :

    +---+-------------------------+-------------------------+
    |id |timestamp_value          |new_timestamp            |
    +---+-------------------------+-------------------------+
    |1  |2017-08-01T14:30:00+05:30|2017-08-01T09:00:00+00:00|
    |2  |2017-08-01T14:30:00+06:30|2017-08-01T08:00:00+00:00|
    |3  |2017-08-01T14:30:00+07:30|2017-08-01T07:00:00+00:00|
    +---+-------------------------+-------------------------+
    

    Finally you can drop and rename :

    df = df.drop("timestamp_value").withColumnRenamed("new_timestamp","timestamp_value")
    

提交回复
热议问题