Get date from two different timestamp formats in one pyspark dataframe [duplicate]

混江龙づ霸主 提交于 2020-08-19 11:12:28

问题


I have a pyspark dataframe that has a timestamp field. But it contains two types of timestamp format (both are strings).

   +----------------------+
    | timestamp           |
    +---------------------+
    | 06-06-2019,17:15:46|
    +---------------------+
    |2020-01-01T06:07:22.000Z

How can I create another "date"column in the same pyspark dataframe that captures only the date based on the timestamp field ? The ideal result looks like this

+----------+---------------------+
|      date|timestamp            |
+----------+----------------------+
|2019-06-06| 06-06-2019,17:15:46 |
+----------+----------------------+
|2020-01-01|2020-01-01T06:07:22.000Z|

回答1:


I think we need to define a function for this case and use the function in dataframe.

Example:

from pyspark.sql.functions import coalesce, col, to_date

def dynamic_date(col, frmts=("MM-dd-yyyy", "yyyy-MM-dd")):
    return coalesce(*[to_date(col, i) for i in frmts])

df.show(10,False)
#+------------------------+
#|timestamp               |
#+------------------------+
#|06-06-2019,17:15:46     |
#|2020-01-01T06:07:22.000Z|
#+------------------------+

df.withColumn("dd",dynamic_date(col("timestamp"))).show(10,False)
#+------------------------+----------+
#|timestamp               |dd        |
#+------------------------+----------+
#|06-06-2019,17:15:46     |2019-06-06|
#|2020-01-01T06:07:22.000Z|2020-01-01|
#+------------------------+----------+


来源:https://stackoverflow.com/questions/63381249/get-date-from-two-different-timestamp-formats-in-one-pyspark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!