How do I truncate a PySpark dataframe of timestamp type to the day?

和自甴很熟 提交于 2019-12-02 12:00:22

问题


I have a PySpark dataframe that includes timestamps in a column (call the column 'dt'), like this:

2018-04-07 16:46:00
2018-03-06 22:18:00

When I execute:

SELECT trunc(dt, 'day') as day

...I expected:

2018-04-07 00:00:00
2018-03-06 00:00:00

But I got:

null
null

How do I truncate to the day instead of the hour?


回答1:


You use wrong function. trunc supports only a few formats:

Returns date truncated to the unit specified by the format.

:param format: 'year', 'yyyy', 'yy' or 'month', 'mon', 'mm'

Use date_trunc instead:

Returns timestamp truncated to the unit specified by the format.

:param format: 'year', 'yyyy', 'yy', 'month', 'mon', 'mm', 'day', 'dd', 'hour', 'minute', 'second', 'week', 'quarter'

Example:

from pyspark.sql.functions import col, date_trunc

df = spark.createDataFrame(["2018-04-07 23:33:21"], "string").toDF("dt").select(col("dt").cast("timestamp"))

df.select(date_trunc("day", "dt")).show()
# +-------------------+                                                           
# |date_trunc(day, dt)|
# +-------------------+
# |2018-04-07 00:00:00|
# +-------------------+



回答2:


One simple way to do it with string manipulation:

from pyspark.sql.functions import lit, concat

df = df.withColumn('date', concat(df.date.substr(0, 10), lit(' 00:00:00'))) 


来源:https://stackoverflow.com/questions/49947962/how-do-i-truncate-a-pyspark-dataframe-of-timestamp-type-to-the-day

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!