什么是Kafka?
分布式、流数据平台,类似消息队列
流数据平台可以:1、发布和订阅流数据
用于暂存重要数据)
Kafka通常用来做告诉输出的缓存,这是由于它变态的性能。另外它可以暂存重要数据,因为从Kafka消费(读取)数据后数据是不会立即删除的,而且这些数据可以设置备份。
Kafka的API
Producer:用于向Kafka生产数据,单个producer可以对应多topics
Consumer:用于从Kafka消费数据,单个consumer可以对应多topics
Streams:用于做简单的流数据处理,可生产可消费,多topics
Connector:用于创建运行多个producer和consumer
多语言:Java等主流语言都有对应的API
Topic:records的分类,所有record都是发布到某一个topic的
Record:{key,value,timestamp}
可以把Kafka理解为不可修改的queue(of records),保存一段历史时间的数据eg.两天
每个consumer元数据只保存一个offset,可由consumer自由控制,也就是说offset不一定是简单递增的,如果有这个需要,可以返回读取两天前的数据,或者直接读取最新的数据
Partition 一个topic被分为多个partition,可存在不同机器上,record被分配到不同partition(可根据key写分区函数),减少单个节点负载
Replica 每个partition有多个复制,可存在不同机器上,由一个leader和零或多个follower组成,只有leader可读可写,follower只写,leader挂掉后从follower中选出新的leader
保证:每个partition内按输入顺序存放
其他概念:
Consumer group:同一个topic的不同partition平均动态分配到一个consumer group的所有consumer上,同一个topic可以broadcast到每一个consumer group(subscriber),每一个topic被分解传输到group里的不同consumer,如果同一个group中的消费者数量多于本topic的partition数量,多余的消费者将接收不到消息
Broker:Kafka集群中的机器/服务被成为broker, 是一个物理概念。
ISR:in-syncreplicas,存活的replica,如果一个topic的partitions有三个复制,存在0,1,2号机器上,那么他的ISR就是0,1,2
Shrink:如果一个broker挂掉了,ISR会收缩(shrink),重新起来后ISR会扩大(expand)
HW(highwatermark高水位线):指partition的ISR中所对应的log的LEO(LogEndOffset)中最小的那个值
Commit(提交):指consumer读取后更新当前分区offset,有多种处理提交的方法
Kafka性能:
Producer Throughput
Single producer thread, no replication
821,557 records/sec
(78.3 MB/sec)
Consumer Throughput
Single Consumer
940,521 records/sec
(89.7 MB/sec)
这只是单个producer和consumer,单个线程,多个线程、或者多个producer或consumer,这个数据还要恐怖。
详见
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
安装和简单命令行producer、consumer
(来自http://kafka.apache.org/quickstart)
下载解压:
$tar -xzf kafka_2.11-1.1.0.tgz $cd kafka_2.11-1.1.0
启动Kafka自带单机版zookeeper
$bin/zookeeper-server-start.sh config/zookeeper.properties
启动Kafka
$bin/kafka-server-start.sh config/server.properties
创建topic
$bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
查看topic信息
$bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test
创建producer发送数据(命令行)
$bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test >This is amessage >This is anothermessage >
创建consumer接收数据(命令行)
$bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test
查看历史数据
前一条命令加上--from-beginning
查看partition最新offset(producer的offset)
$bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092--topic test
查看现有consumer group的group id
查看当前 已有的topic
$bin/kafka-topics.sh--list --zookeeper localhost:2181
Kafka Demo实现(java)
producer
package com.cloudwiz.kafkatest.example; import java.util.Properties; import java.util.concurrent.ExecutionException; import org.apache.kafka.clients.producer.Callback; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class Producer extends Thread{ private final int producerNo; private final String topic; private final KafkaProducer<Integer,String> producer; public Producer(String topic,int producerNo) { this.topic = topic; this.producerNo=producerNo; Properties props = new Properties(); props.put("bootstrap.servers", KafkaProperties.KAFKA_SERVER_URL + ":" + KafkaProperties.KAFKA_SERVER_PORT); props.put("client.id", "Producer_"+producerNo); props.put("key.serializer", "org.apache.kafka.common.serialization.IntegerSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); producer = new KafkaProducer<>(props); } public void run() { int messageNo = 0; while(true) { String messagestr = "Message_"+messageNo+"_from_producer_"+producerNo ; try { //同步发送 RecordMetadata recordMetadata = producer.send(new ProducerRecord<>(topic , messageNo , messagestr)).get(); System.out.println("Sent message:"+messagestr+", partition_"+recordMetadata.partition() + ", offset"+recordMetadata.offset()); //异步发送 // producer.send(new ProducerRecord<>(topic , messageNo , messagestr),new ProducerCallback()); // System.out.println("Sent message:"+messagestr); //发送并忘记 // producer.send(new ProducerRecord<>(topic , messageNo , messagestr)); } catch (InterruptedException | ExecutionException e) { e.printStackTrace(); } ++messageNo; try { Thread.sleep(300); } catch (InterruptedException e) { e.printStackTrace(); } } } /** * @param args [0]:topic [1] 一次生成producer数量 */ public static void main(String[] args) { int numProducer = Integer.parseInt(args[1]); for(int i = 0;i<numProducer;i++) { Producer producer = new Producer(args[0] , i); producer.start(); } } } class ProducerCallback implements Callback{ @Override public void onCompletion(RecordMetadata metadata, Exception e) { if(e!=null)e.printStackTrace(); } }
consumer
package com.cloudwiz.kafkatest.example; import java.util.Collections; import java.util.Properties; import org.apache.kafka.clients.consumer.ConsumerConfig; import org.apache.kafka.clients.consumer.ConsumerRecord; import org.apache.kafka.clients.consumer.ConsumerRecords; import org.apache.kafka.clients.consumer.KafkaConsumer; public class Consumer extends Thread{ private final String topic; private final KafkaConsumer<Integer,String> consumer; private int consumerNo; private String groupId; Consumer(String topic,String groupId,int consumerNo){ this.topic=topic; this.groupId = groupId; Properties props = new Properties(); props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, KafkaProperties.KAFKA_SERVER_URL + ":" + KafkaProperties.KAFKA_SERVER_PORT); props.put(ConsumerConfig.GROUP_ID_CONFIG, this.groupId); props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true"); props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000"); props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "30000"); props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.IntegerDeserializer"); props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer"); consumer = new KafkaConsumer<>(props); this.consumerNo = consumerNo; } @Override public void run() { while(true) { // consumer.subscribe(Collections.singletonList(topic)); ConsumerRecords<Integer,String> records = consumer.poll(1000); for(ConsumerRecord<Integer,String> record:records) { System.out.println("Consumer No." + this.consumerNo + " in group " + groupId + " received a record, " + "key = "+record.key()+", value = "+record.value()+", partiton = "+record.partition()+", offset = "+record.offset()); } } } /** * @param args [0]:topic [1]:一次生成consumer数量 [2]:consumer group id */ public static void main(String[] args) { int numConsumer = Integer.parseInt(args[1]); for(int i = 0;i<numConsumer;i++) { Thread consumer = new Consumer(args[0],args[2],i); consumer.start(); } } }
KafkaProperties
package com.cloudwiz.kafkatest.example; public class KafkaProperties { public static final String KAFKA_SERVER_URL = "192.168.235.138"; public static final String KAFKA_SERVER_PORT = "9092"; }
利用AdminClient API创建新topics
可指定partition数和热replica数
//if topic already exist, will not override private void createNewTopicInKafka(String token) { Properties props = new Properties(); props.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, ServerProperties.KAFKA_SERVER_IP +":"+ServerProperties.KAFKA_SERVER_PORT); props.put(AdminClientConfig.CLIENT_ID_CONFIG, "admin"); AdminClient admin = AdminClient.create(props); try { if(admin.listTopics().names().get().contains(token))return; CreateTopicsResult res = admin.createTopics(Collections.singletonList(new NewTopic(token,ServerProperties.KAFKA_NUM_PARTITIONS,ServerProperties.KAFKA_NUM_RELICAS))); res.all().get(); }catch (InterruptedException | ExecutionException e) { e.printStackTrace(); }finally{ admin.close(); } }