Bucket records based on time(kafka-hdfs-connector)

前提是你 提交于 2019-12-07 23:27:31

问题


I am trying to copy data from Kafka into Hive tables using kafka-hdfs-connector provided by Confluent platform. While I am able to do it successfully I was wondering how to bucket the incoming data based on time interval. For example, I would like to have a new partition created every 5 minutes.

I tried io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner with partition.duration.ms but I think I am doing it the wrong way. I see only one partition in the Hive table with all the data going into that particular partition. Something like this :

hive> show partitions test;
OK
partition
year=2016/month=03/day=15/hour=19/minute=03

And all the avro objects are getting copied into this partition.

Instead, I would like to have something like this :

hive> show partitions test;
OK
partition
year=2016/month=03/day=15/hour=19/minute=03
year=2016/month=03/day=15/hour=19/minute=08
year=2016/month=03/day=15/hour=19/minute=13

Initially connector will create the path year=2016/month=03/day=15/hour=19/minute=03 and will continue to copy all the incoming data into this directory for next 5 minutes, and at the start of 6th minute it should create a new path, i.e year=2016/month=03/day=15/hour=19/minute=08 and copy the data for next 5 minutes into this directory, and so on.

This is how my config file looks like :

name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=test
hdfs.url=hdfs://localhost:9000
flush.size=3
partitioner.class=io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner
partition.duration.ms=300000
path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=MM/
locale=en
timezone=GMT
logs.dir=/kafka-connect/logs
topics.dir=/kafka-connect/topics
hive.integration=true
hive.metastore.uris=thrift://localhost:9083
schema.compatibility=BACKWARD

It would be really helpful if someone could point me in the right direction. I would be glad to share more details in case it's required. Don't want to make this question look like one that never ends.

Many thanks!


回答1:


your minute field in path.format is wrong:

path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=MM/

it should be:

path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm/


来源:https://stackoverflow.com/questions/36036507/bucket-records-based-on-timekafka-hdfs-connector

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!