问题
I'm following here.While following the code. I came up with two Questions
- Is the Key and offset were the same?
According to Google,
Offset: A Kafka topic receives messages across a distributed set of partitions where they are stored. Each partition maintains the messages it has received in a sequential order where they are identified by an offset, also known as a position.
Seems both are very similar for me. Since offset maintain a unique message in the partition: Producers send records to a partition based on the record’s key
- What is the best way to choose the Key/Offset for a producer?
For an instance the example which I provided above, they have chosen the timestamp as the Key and offset. Is this the always the best recommendation?
class IRCMessageListener extends IRCEventAdapter {
@Override
public void onPrivmsg(String channel, IRCUser u, String msg) {
IRCMessage event = new IRCMessage(channel, u, msg);
//FIXME kafka round robin default partitioner seems to always publish to partition 0 only (?)
long ts = event.getInt64("timestamp");
Map<String, ?> srcOffset = Collections.singletonMap(TIMESTAMP_FIELD, ts);
Map<String, ?> srcPartition = Collections.singletonMap(CHANNEL_FIELD, channel);
SourceRecord record = new SourceRecord(srcPartition, srcOffset, topic, KEY_SCHEMA, ts, IRCMessage.SCHEMA, event);
queue.offer(record);
}
Because I'm actually trying to create a custom Kafka connector to get the data from 3rd Party WebSocket API. The API sends real-time data stream messages for a given Key value. So I thought of using that Key for my PartitionKey as well as Offset. But need to make sure I'm right about my thought.
回答1:
Key is an optional metadata, that can be sent with a Kafka message, and by default, it is used to route message to a specific partition. E.g. if you're sending a message m with key as k, to a topic mytopic that has p partitions, then m goes to the partition Hash(k) % p in mytopic. It has no connection to the offset of a partition whatsoever. Offsets are used by consumers to keep track of the position of last read message in a partition. In your case, if the timestamp is fairly randomly distributed, then it's fine, else you might be causing partition imbalance while using it as key.
回答2:
These are some basic differences :
Offset : maintained by kafka to keep a track of the records consumed to avoid loss of records and duplicate records while consuming.
Key : it is specific to input events,if it is not available then by default it is mentioned as null,this is useful while writing records to HDFS with default partition-er using kafka connect.every message can have a single key or many messages can have similar key.
来源:https://stackoverflow.com/questions/51245962/how-to-choose-a-key-and-offset-for-a-kafka-producer