Calculating duration by subtracting two datetime columns in string format

后端 未结 6 1816
盖世英雄少女心
盖世英雄少女心 2020-12-04 15:40

I have a Spark Dataframe in that consists of a series of dates:

from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import          


        
6条回答
  •  醉话见心
    2020-12-04 16:25

    Use DoubleType instead of IntegerType

    from pyspark.sql import SQLContext, Row
    sqlContext = SQLContext(sc)
    from pyspark.sql.types import StringType, IntegerType, StructType, StructField
    from pyspark.sql.functions import udf
    
    
    # Build sample data
    rdd = sc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876'),
                          ('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321'),
                          ('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229'),
                          ('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881'),
                          ('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323')])
    schema = StructType([StructField('ID', StringType(), True),
                         StructField('EndDateTime', StringType(), True),
                         StructField('StartDateTime', StringType(), True)])
    df = sqlContext.createDataFrame(rdd, schema)
    
    # define timedelta function (obtain duration in seconds)
    def time_delta(y,x): 
        from datetime import datetime
        end = datetime.strptime(y, '%Y-%m-%dT%H:%M:%S.%f')
        start = datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%f')
        delta = (end-start).total_seconds()
        return delta
    
    # register as a UDF 
    f = udf(time_delta, DoubleType())
    
    # Apply function
    df2 = df.withColumn('Duration', f(df.EndDateTime, df.StartDateTime))
    

提交回复
热议问题