Calculating duration by subtracting two datetime columns in string format

后端 未结 6 1804
盖世英雄少女心
盖世英雄少女心 2020-12-04 15:40

I have a Spark Dataframe in that consists of a series of dates:

from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import          


        
6条回答
  •  时光取名叫无心
    2020-12-04 16:11

    As of Spark 1.5 you can use unix_timestamp:

    from pyspark.sql import functions as F
    timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
    timeDiff = (F.unix_timestamp('EndDateTime', format=timeFmt)
                - F.unix_timestamp('StartDateTime', format=timeFmt))
    df = df.withColumn("Duration", timeDiff)
    

    Note the Java style time format.

    >>> df.show()
    +---+--------------------+--------------------+--------+
    | ID|         EndDateTime|       StartDateTime|Duration|
    +---+--------------------+--------------------+--------+
    |X01|2014-02-13T12:36:...|2014-02-13T12:31:...|     258|
    |X02|2014-02-13T12:35:...|2014-02-13T12:32:...|     204|
    |X03|2014-02-13T12:36:...|2014-02-13T12:32:...|     228|
    |XO4|2014-02-13T12:37:...|2014-02-13T12:32:...|     269|
    |XO5|2014-02-13T12:36:...|2014-02-13T12:33:...|     202|
    +---+--------------------+--------------------+--------+
    

提交回复
热议问题