PySpark dataframe convert unusual string format to Timestamp

匿名 (未验证) 提交于 2019-12-03 08:39:56

问题:

I am using PySpark through Spark 1.5.0. I have an unusual String format in rows of a column for datetime values. It looks like this:

Row[(daytetime='2016_08_21 11_31_08')] 

Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd format into a Timestamp? Something that can eventually come along the lines of

df = df.withColumn("date_time",df.daytetime.astype('Timestamp')) 

I had thought that Spark SQL functions like regexp_replace could work, but of course I need to replace _ with - in the date half and _ with : in the time part. I was thinking I could split the column in 2 using substring and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?

回答1:

Spark >= 2.2

from pyspark.sql.functions import to_timestamp  (sc     .parallelize([Row(dt='2016_08_21 11_31_08')])     .toDF()     .withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd hh_mm_ss"))     .show(1, False))  ## +-------------------+-------------------+ ## |dt                 |parsed             | ## +-------------------+-------------------+ ## |2016_08_21 11_31_08|2016-08-21 11:31:08| ## +-------------------+-------------------+ 

Spark < 2.2

It is nothing that unix_timestamp cannot handle:

from pyspark.sql import Row from pyspark.sql.functions import unix_timestamp  (sc     .parallelize([Row(dt='2016_08_21 11_31_08')])     .toDF()     .withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd hh_mm_ss")     .cast("double")     .cast("timestamp"))     .show(1, False))  ## +-------------------+---------------------+ ## |dt                 |parsed               | ## +-------------------+---------------------+ ## |2016_08_21 11_31_08|2016-08-21 11:31:08.0| ## +-------------------+---------------------+ 


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!