How to display intermediate results in a windowed streaming-etl?

问题

We currently do a real-time aggregation of data in an event-store. The idea is to visualize transaction data for multiple time ranges (monthly, weekly, daily, hourly) and for multiple nominal keys. We regularly have late data, so we need to account for that. Furthermore the requirement is to display "running" results, that is value of the current window even before it is complete.

Currently we are using Kafka and Apache Storm (specifically Trident i.e. microbatches) to do this. Our architecture roughly looks like this:

(Apologies for my ugly pictures). We use MongoDB as a key-value store to persist the State and then make it accessible (read-only) by a Microservice that returns the current value it was queried for. There are multiple problems with that design

The code is really high maintenance
It is really hard to guarantee exactly-once processing in this manner
Updating the state after every aggregation obviously has performance implications but it is sufficiently fast.

We got the impression, that with Apache Flink or Kafka streams better frameworks (especially from a maintenance standpoint - Storm tends to be really verbose) have become available since we started this project. Trying these out it seemed like writing to a database, especially mongoDB is not state of the art anymore. The standard use case I saw is state being persisted internally in RocksDB or memory and then written back to Kafka once a window is complete.

Unfortunately this makes it quite difficult to display intermediate results and because the state is persisted internally we would need the allowed Lateness of events to be in the order of months or years. Is there a better solution for this requirements than hijacking the state of the real-time stream? Personally I feel like this would be a standard requirement but couldn't find a standard solution for this.

回答1:

You could study Konstantin Knauf's Queryable Billing Demo as an example of how to approach some of the issues involved. The central, relevant ideas used there are:

Trigger the windows after every event, so that their results are being continuously updated
Make the results queryable (using Flink's queryable state API)

This was the subject of a Flink Forward conference talk. Video is available.

Rather than making the results queryable, you could instead stream out the window updates to a dashboard or database.

Also, note that you can cascade windows, meaning that the results of the hourly windows could be the input to the daily windows, etc.

回答2:

Kafka Streams offers "Interactive Queries". It's basically the same as Flink's "queryable state", however, it's not marked as "beta" as in Flink.

In fact, for Kafka Streams there is work in progress to make "Interactive Queries" highly available exploiting Kafka Streams "standby tasks" (https://docs.confluent.io/current/streams/architecture.html#fault-tolerance).

For more details, checkout out the following references:

Blog post: https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
Docs: https://docs.confluent.io/current/streams/developer-guide/interactive-queries.html
Examples: https://github.com/confluentinc/kafka-streams-examples/tree/5.4.0-post/src/main/java/io/confluent/examples/streams/interactivequeries
KIP-535 (will be included in upcoming Apache Kafka 2.5 release): https://cwiki.apache.org/confluence/display/KAFKA/KIP-535%3A+Allow+state+stores+to+serve+stale+reads+during+rebalance

来源：https://stackoverflow.com/questions/59857255/how-to-display-intermediate-results-in-a-windowed-streaming-etl

标签

apache-kafka

stream

apache-flink

apache-kafka-streams