I have two timestamp columns in a dataframe that I\'d like to get the minute difference of, or alternatively, the hour difference of. Currently I\'m able to get the day diff
You can get the difference in seconds by
import org.apache.spark.sql.functions._
val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
Then you can do some math to get the unit you want. For example:
val df2 = df1
.withColumn( "diff_secs", diff_secs_col )
.withColumn( "diff_mins", diff_secs_col / 60D )
.withColumn( "diff_hrs", diff_secs_col / 3600D )
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )
Or, in pyspark:
from pyspark.sql.functions import *
diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
df2 = df1 \
.withColumn( "diff_secs", diff_secs_col ) \
.withColumn( "diff_mins", diff_secs_col / 60D ) \
.withColumn( "diff_hrs", diff_secs_col / 3600D ) \
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )
The answer given by Daniel de Paula works, but that solution does not work in the case where the difference is needed for every row in your table. Here is a solution that will do that for each row:
import org.apache.spark.sql.functions
val df2 = df1.selectExpr("(unix_timestamp(ts1) - unix_timestamp(ts2))/3600")
This first converts the data in the columns to a unix timestamp in seconds, subtracts them and then converts the difference to hours.
A useful list of functions can be found at: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.functions$