I am using PySpark through Spark 1.5.0. I have an unusual String format in rows of a column for datetime values. It looks like this:
Row[(daytetime='2016_08_21 11_31_08')]
Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd
format into a Timestamp? Something that can eventually come along the lines of
df = df.withColumn("date_time",df.daytetime.astype('Timestamp'))
I had thought that Spark SQL functions like regexp_replace
could work, but of course I need to replace _
with -
in the date half and _
with :
in the time part. I was thinking I could split the column in 2 using substring
and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?
Spark >= 2.2
from pyspark.sql.functions import to_timestamp (sc .parallelize([Row(dt='2016_08_21 11_31_08')]) .toDF() .withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd hh_mm_ss")) .show(1, False)) ## +-------------------+-------------------+ ## |dt |parsed | ## +-------------------+-------------------+ ## |2016_08_21 11_31_08|2016-08-21 11:31:08| ## +-------------------+-------------------+
Spark < 2.2
It is nothing that unix_timestamp
cannot handle:
from pyspark.sql import Row from pyspark.sql.functions import unix_timestamp (sc .parallelize([Row(dt='2016_08_21 11_31_08')]) .toDF() .withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd hh_mm_ss") .cast("double") .cast("timestamp")) .show(1, False)) ## +-------------------+---------------------+ ## |dt |parsed | ## +-------------------+---------------------+ ## |2016_08_21 11_31_08|2016-08-21 11:31:08.0| ## +-------------------+---------------------+