HDFS sink: “clever” folder routing

问题

I am new to Flume (and to HDFS), so I hope my question is not stupid.

I have a multi-tenant application (about 100 different customers as for now). I have 16 different data types.

(In production, we have approx. 15 million messages/day through our RabbitMQ)

I want to write to HDFS all my events, separated by tenant, data type, and date, like this :

/data/{tenant}/{data_type}/2014/10/15/file-08.csv

Is it possible with one sink definition ? I don't want to duplicate configuration, and new client arrive every week or so

In documentation, I see

agent1.sinks.hdfs-sink1.hdfs.path = hdfs://server/events/%Y/%m/%d/%H/

Is this possible ?

agent1.sinks.hdfs-sink1.hdfs.path = hdfs://server/events/%tenant/%type/%Y/%m/%d/%H/

I want to write to different folders according to my incoming data.

回答1:

Yes this is indeed possible. You can either use the metadata or some field in the incoming data to redirect the output to.

For example, in my case I am getting different types of log data and I want to store it in respective folders accordingly. Also in my case the first word in my log lines is the file name. Here is the config snippet for the same.

Interceptor:

dataplatform.sources.source1.interceptors = i3
dataplatform.sources.source1.interceptors.i3.type = regex_extractor
dataplatform.sources.source1.interceptors.i3.regex = ^(\\w*)\t.*
dataplatform.sources.source1.interceptors.i3.serializers = s1
dataplatform.sources.source1.interceptors.i3.serializers.s1.name = filename

HDFS Sink

dataplatform.sinks.sink1.type = hdfs
dataplatform.sinks.sink1.hdfs.path = hdfs://server/events/provider=%{filename}/years=%Y/months=%Y%m/days=%Y%m%d/hours=%H

Hope this helps.

回答2:

Possible solution may be to write an interceptor which passes the tenant value.

please refer to the link below

http://hadoopi.wordpress.com/2014/06/11/flume-getting-started-with-interceptors/

来源：https://stackoverflow.com/questions/26385035/hdfs-sink-clever-folder-routing

标签

HDFS

flume