Structured Streaming Kafka Source Offset Storage

问题

I am using the Structured Streaming source for Kafka (Integration guide), which as stated does not commit any offset.

One of my goals is to monitor it (check if its lagging behind etc). Even though it does not commit the offsets it handles them by querying kafka from time to time and checking which is the next one to process. According to the documentation the offsets are written to HDFS so in case of failure it can be recovered, but the question is:

Where are they being stored? Is there any way of monitoring the kafka consumption (from outside of the program; so a kafka cli or similar, the offset coming with each record does not suit the use case) of a spark streaing (structured) if it does not commit the offsets?

Cheers

回答1:

Structured Streaming for kafka saves offsets to HDFS below structures.

Example checkpointLocation setting is below.

.writeStream.
.....
  option("checkpointLocation", "/tmp/checkPoint")
.....

In that case, Structured Streaming for kafka saves below path

/tmp/checkPoint/offsets/$'batchid'

Saved file contains below format.

v1
{"batchWatermarkMs":0,"batchTimestampMs":$'timestamp',"conf":{"spark.sql.shuffle.partitions":"200"}}
{"Topic1WithPartiton1":{"0":$'OffsetforTopic1ForPartition0'},"Topic2WithPartiton2":{"1":$'OffsetforTopic2ForPartition1',"0":$'OffsetforTopic2ForPartition1'}}

For example.

v1
{"batchWatermarkMs":0,"batchTimestampMs":1505718000115,"conf":{"spark.sql.shuffle.partitions":"200"}}
{"Topic1WithPartiton1":{"0":21482917},"Topic2WithPartiton2":{"1":103557997,"0":103547910}}

So, I think for monitoring offset lag, it needs to develop custom tools what has below functions.

Read from offsets from HDFS.
Write offset to Kafka __offset topic.

That way, already existing offset lag monitoring tool can monitor Structured Streaming for kafka's offset lag.

回答2:

Method 1: If you have configured checkpointLocation (HDFS/S3 etc) go to the path and you will find two directories offsets and commits. Offsets hold the current offsets while commits have last committed offsets. You can navigate to commits directory and open the latest modified file in which you can find the last committed offsets. While the latest file in offsets directory holds info of consumed offsets.

Method 2: You can also monitor the same with the following configurations:

class CustomStreamingQueryListener extends StreamingQueryListener with AppLogging {

  override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = {
    logDebug(s"Started query with id : ${event.id}," +
      s" name: ${event.name},runId : ${event.runId}")
  }

  override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = {
    val progress = event.progress
    logDebug(s"Streaming query made progress: ${progress.prettyJson}")
  }

  override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = {
    logDebug(s"Stream exited due to exception : ${event.exception},id : ${event.id}, " +
      s"runId: ${event.runId}")
  }

}

and add it to your streams config.

spark.streams.addListener(new CustomStreamingQueryListener())

回答3:

Few things to note:

Monitoring: Ready made monitoring can be found in Streaming tab of Spark job. You can see what is the current batch that is getting processed and how many are in queue to check the lag.

Check max and min offset for a topic : You have the cli to check these. Can use below from the server where the kafka broker is present:

kafka-run-class \
kafka.tools.GetOffsetShell \
--broker-list your_broker1:port,your_broker2:port,your_broker3:port \
--topic your_topic \
--time -2

More detailed info can be obtained, if you integrate with Grafana

来源：https://stackoverflow.com/questions/43662044/structured-streaming-kafka-source-offset-storage

标签

apache-spark

apache-kafka

spark-streaming

offset

spark-structured-streaming