Can I say only current batch by watermarking and window logic for aggregating a streaming data in Append Output mode?

问题

I am taking a join of a streaming dataset in LHS with a static dataset in RHS. Since there can be multiple matches for a row in LHS in the static dataset the data explodes into duplicate rows for one id of LHS during the left_outer join, I want to group all these rows collecting the RHS matches into a list.

Since it is guaranteed there will be no duplicates in the streaming data, I don't want to introduce a synthetic watermarking column and aggregate the data based on a time-window around that column. All the duplicates are introduced with my join only so I don't need to wait for any amount of time to collect those. Is this possible in spark streaming join and aggregation?

val joinedData = lhsDataset.join(
  right = rhsDataset,
  joinExprs = lhsDataset("id") === rhsDataset("id")
  "left_outer"
)

val aggregatedDataset = joinedData 
  .groupBy(
    col("id"))
  .agg(
    collect_set(col("SubjectEnrolled")).alias("Subjects")
)
aggregatedDataset.show()

This asks me to add watermarking and window and it, in turn, asks me to add synthetic times. Can this be avoided saying something like group only this micro-batch of streaming data?

来源：https://stackoverflow.com/questions/57705311/can-i-say-only-current-batch-by-watermarking-and-window-logic-for-aggregating-a

标签

apache-spark

apache-spark-sql

spark-streaming

apache-spark-dataset

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!