Spark SQL Window over interval of between two specified time boundaries - between 3 hours and 2 hours ago

无人久伴 提交于 2019-12-04 19:11:36

Since range intervals didn't work their thing, I had to turn to an alternative approach. It goes something like this:

  • prepare a list of intervals for which computation needs to be performed
  • for each of the intervals, run the computation
    • each of those iterations produces a data frame
  • after the iterations, we have a list of data frames
  • union the data frames from the list into one bigger data frame
  • write out the results

In my case, I had to run computations for each hour of the day and combine those "hourly" results, i.e. a list of 24 data frames, into one, "daily", data frame.

Code, from very high level perspective, looks like this:

val hourlyDFs = for ((hourStart, hourEnd) <- (hoursToStart, hoursToEnd).zipped) yield {
    val data = data.where($"hour" <= lit(hourEnd) && $"hour" >= lit(hourStart))
    // do stuff
    // return a data frame
}
hourlyDFs.toSeq().reduce(_.union(_))
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!