可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am using PySpark through Spark 1.5.0. I have an unusual String format in rows of a column for datetime values. It looks like this:

Row[(daytetime='2016_08_21 11_31_08')]

Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd format into a Timestamp? Something that can eventually come along the lines of

df = df.withColumn("date_time",df.daytetime.astype('Timestamp'))

I had thought that Spark SQL functions like regexp_replace could work, but of course I need to replace _ with - in the date half and _ with : in the time part. I was thinking I could split the column in 2 using substring and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?

回答1:

Spark >= 2.2

from pyspark.sql.functions import to_timestamp  (sc     .parallelize([Row(dt='2016_08_21 11_31_08')])     .toDF()     .withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd hh_mm_ss"))     .show(1, False))  ## +-------------------+-------------------+ ## |dt                 |parsed             | ## +-------------------+-------------------+ ## |2016_08_21 11_31_08|2016-08-21 11:31:08| ## +-------------------+-------------------+

Spark < 2.2

It is nothing that unix_timestamp cannot handle:

from pyspark.sql import Row from pyspark.sql.functions import unix_timestamp  (sc     .parallelize([Row(dt='2016_08_21 11_31_08')])     .toDF()     .withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd hh_mm_ss")     .cast("double")     .cast("timestamp"))     .show(1, False))  ## +-------------------+---------------------+ ## |dt                 |parsed               | ## +-------------------+---------------------+ ## |2016_08_21 11_31_08|2016-08-21 11:31:08.0| ## +-------------------+---------------------+

文章来源: PySpark dataframe convert unusual string format to Timestamp

标签

convert

dataframe