问题
I am taking a join of a streaming dataset in LHS with a static dataset in RHS. Since there can be multiple matches for a row in LHS in the static dataset the data explodes into duplicate rows for one id of LHS during the left_outer join, I want to group all these rows collecting the RHS matches into a list.
Since it is guaranteed there will be no duplicates in the streaming data, I don't want to introduce a synthetic watermarking column and aggregate the data based on a time-window around that column. All the duplicates are introduced with my join only so I don't need to wait for any amount of time to collect those. Is this possible in spark streaming join and aggregation?
val joinedData = lhsDataset.join(
right = rhsDataset,
joinExprs = lhsDataset("id") === rhsDataset("id")
"left_outer"
)
val aggregatedDataset = joinedData
.groupBy(
col("id"))
.agg(
collect_set(col("SubjectEnrolled")).alias("Subjects")
)
aggregatedDataset.show()
This asks me to add watermarking and window and it, in turn, asks me to add synthetic times. Can this be avoided saying something like group only this micro-batch of streaming data?
来源:https://stackoverflow.com/questions/57705311/can-i-say-only-current-batch-by-watermarking-and-window-logic-for-aggregating-a