问题
I have a pyspark dataframe that has a timestamp field. But it contains two types of timestamp format (both are strings).
+----------------------+
| timestamp |
+---------------------+
| 06-06-2019,17:15:46|
+---------------------+
|2020-01-01T06:07:22.000Z
How can I create another "date"column in the same pyspark dataframe that captures only the date based on the timestamp field ? The ideal result looks like this
+----------+---------------------+
| date|timestamp |
+----------+----------------------+
|2019-06-06| 06-06-2019,17:15:46 |
+----------+----------------------+
|2020-01-01|2020-01-01T06:07:22.000Z|
回答1:
I think we need to define a function for this case and use the function in dataframe.
Example:
from pyspark.sql.functions import coalesce, col, to_date
def dynamic_date(col, frmts=("MM-dd-yyyy", "yyyy-MM-dd")):
return coalesce(*[to_date(col, i) for i in frmts])
df.show(10,False)
#+------------------------+
#|timestamp |
#+------------------------+
#|06-06-2019,17:15:46 |
#|2020-01-01T06:07:22.000Z|
#+------------------------+
df.withColumn("dd",dynamic_date(col("timestamp"))).show(10,False)
#+------------------------+----------+
#|timestamp |dd |
#+------------------------+----------+
#|06-06-2019,17:15:46 |2019-06-06|
#|2020-01-01T06:07:22.000Z|2020-01-01|
#+------------------------+----------+
来源:https://stackoverflow.com/questions/63381249/get-date-from-two-different-timestamp-formats-in-one-pyspark-dataframe