Kafka Connect FileStreamSource ignores appended lines

问题

I'm working on an application to process logs with Spark and I thought to use Kafka as a way to stream the data from the log file. Basically I have a single log file (on the local file system) which is continuously updated with new logs, and Kafka Connect seems to be the perfect solution to get the data from the file along with the new appended lines.

I'm starting the servers with their default configurations with the following commands:

Zookeeper server:

zookeeper-server-start.sh config/zookeeper.properties

zookeeper.properties

dataDir=/tmp/zookeeper
clientPort=2181
maxClientCnxns=0

Kafka server:

kafka-server-start.sh config/server.properties

server.properties

broker.id=0
log.dirs=/tmp/kafka-logs
zookeeper.connect=localhost:2181
[...]

Then I created the topic 'connect-test':

kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic connect-test

And finally I run the Kafka Connector:

connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties

connect-standalone.properties

bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true

internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false

offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000

connect-file-source.properties

name=my-file-connector
connector.class=FileStreamSource
tasks.max=1
file=/data/users/zamara/suivi_prod/app/data/logs.txt
topic=connect-test

At first I tested the connector by running a simple console consumer:

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic connect-test --from-beginning

Everything was working perfectly, the consumer was receiving the logs from the file and as I was adding logs the consumer kept updating with the new ones.

(Then I tried Spark as a "consumer" following this guide: https://spark.apache.org/docs/2.2.0/streaming-kafka-0-8-integration.html#approach-2-direct-approach-no-receivers and it was still fine)

After this, I removed some of the logs from the log file and changed the topic (I deleted the 'connect-test' topic, created another one and edited the connect-file-source.properties with the new topic).

But now the connector doesn't work the same way anymore. When using the console consumer, I only get the logs that were already in the file and every new line I add is ignored. Maybe changing the topic (and/or modifying the data from the log file) without changing the connector name broke something in Kafka...

This is what Kafka Connect does with my topic 'new-topic' and connector 'new-file-connector', :

[2018-05-16 15:06:42,454] INFO Created connector new-file-connector (org.apache.kafka.connect.cli.ConnectStandalone:104)
[2018-05-16 15:06:42,487] INFO Cluster ID: qjm74WJOSomos3pakXb2hA (org.apache.kafka.clients.Metadata:265)
[2018-05-16 15:06:42,522] INFO Updated PartitionLeaderEpoch. New: {epoch:0, offset:0}, Current: {epoch:-1, offset:-1} for Partition: new-topic-0. Cache now contains 0 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2018-05-16 15:06:52,453] INFO WorkerSourceTask{id=new-file-connector-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:328)
[2018-05-16 15:06:52,453] INFO WorkerSourceTask{id=new-file-connector-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:345)
[2018-05-16 15:06:52,458] INFO WorkerSourceTask{id=new-file-connector-0} Finished commitOffsets successfully in 5 ms (org.apache.kafka.connect.runtime.WorkerSourceTask:427)
[2018-05-16 15:07:02,459] INFO WorkerSourceTask{id=new-file-connector-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:328)
[2018-05-16 15:07:02,459] INFO WorkerSourceTask{id=new-file-connector-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:345)
[2018-05-16 15:07:12,459] INFO WorkerSourceTask{id=new-file-connector-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:328)
[2018-05-16 15:07:12,460] INFO WorkerSourceTask{id=new-file-connector-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:345)

(it keeps flushing 0 outstanding messages even after appending new lines to the file)

So I tried to start over: I deleted the /tmp/kafka-logs directory, the /tmp/connect.offset file, and used a brand new topic name, connector name and log file, just in case. But still, the connector ignores new logs... I even tried to delete my kafka, re-extract it from the archive and run the whole process again (in case something changed in Kafka), but no success.

I'm confused as to where the problem is, any help would be appreciated !

回答1:

Per docs:

The FileStream Connector examples are intended to show how a simple connector runs for those first getting started with Kafka Connect as either a user or developer. It is not recommended for production use.

I would use something like Filebeat (with its Kafka output) instead for ingesting logs into Kafka. Or kafka-connect-spooldir if your logs are not appended to directly but are standalone files placed in a folder for ingest.

回答2:

Kafka Connect does not "watch" or "tail" a file. I don't believe it is documented anywhere that it does do that.

I would say it is even less useful for reading active logs than using Spark Streaming to watch a folder. Spark will "recognize" newly created files. Kafka Connect FileStreamSource must point at a single pre-existing, immutable file.

To get Spark to work with active logs, you would need something that does "log rotation" - that is, when the file reaches a max size or a condition such as the end of a time period (say a day), then this process would move the active log to the directory Spark is watching, then it handles starting a new log file for your application to continue to write to.

If you want files to be actively watched and ingested into Kafka then Filebeat, Fluentd, or Apache Flume can be used.

来源：https://stackoverflow.com/questions/50374784/kafka-connect-filestreamsource-ignores-appended-lines

标签

apache-kafka

apache-kafka-connect