flume-ng | 易学教程

Reading Flume spoolDir in parallel

阅读更多关于 Reading Flume spoolDir in parallel

Since I'm not allowed to set up Flume on prod servers, I have to download the logs, put them in a Flume spoolDir and have a sink to consume from the channel and write to Cassandra. Everything is working fine. However, as I have a lot of log files in the spoolDir, and the current setup is only processing 1 file at a time, it's taking a while. I want to be able to process many files concurrently. One way I thought of is to use the spoolDir but distribute the files into 5-10 different directories, and define multiple sources/channels/sinks, but this is a bit clumsy. Is there a better way to

Getting 'checking flume.conf for changes' in a loop

阅读更多关于 Getting 'checking flume.conf for changes' in a loop

问题 I am using Apache Flume 1.4.0 to collect log files (auth.log) and store in HDFS (Hadoop 2.6.0). The command used is: bin/flume-ng agent --conf ./conf/ -f flume.conf -Dflume.root.logger=DEBUG,console -n agent The flume.conf file contains the following: agent.channels.memory-channel.type = memory agent.sources.tail-source.type = exec agent.sources.tail-source.command = tail -F /var/log/auth.log agent.sources.tail-source.channels = memory-channel agent.sinks.log-sink.channel = memory-channel

unable to download data from twitter through flume

阅读更多关于 unable to download data from twitter through flume

bin/flume-ng agent -n TwitterAgent --conf ./conf/ -f conf/flume-twitter.conf -Dflume.root.logger=DEBUG,console When I run the above command it generate the following errors: 2016-05-06 13:33:31,357 (Twitter Stream consumer-1[Establishing connection]) [INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] 404:The URI requested is invalid or the resource requested, such as a user, does not exist. Unknown URL. See Twitter Streaming API documentation at http://dev.twitter.com/pages/streaming_api This is my flume-twitter.conf file located in flume/conf folder: TwitterAgent

Flume - Can an entire file be considered an event in Flume?

阅读更多关于 Flume - Can an entire file be considered an event in Flume?

问题 I have a use case where I need to ingest files from a directory into HDFS. As a POC, I used simple Directory Spooling in Flume where I specified the source, sink and channel and it works fine. The disadvantage is that I would have to maintain multiple directories for multiple file types that go into distinct folders in order to get greater control over file sizes and other parameters, while making configuration repetitive, but easy. As an alternative, I was advised to use regex interceptors

Getting 'checking flume.conf for changes' in a loop

阅读更多关于 Getting 'checking flume.conf for changes' in a loop

I am using Apache Flume 1.4.0 to collect log files (auth.log) and store in HDFS (Hadoop 2.6.0). The command used is: bin/flume-ng agent --conf ./conf/ -f flume.conf -Dflume.root.logger=DEBUG,console -n agent The flume.conf file contains the following: agent.channels.memory-channel.type = memory agent.sources.tail-source.type = exec agent.sources.tail-source.command = tail -F /var/log/auth.log agent.sources.tail-source.channels = memory-channel agent.sinks.log-sink.channel = memory-channel agent.sinks.log-sink.type = logger agent.sinks.hdfs-sink.channel = memory-channel agent.sinks.hdfs-sink

Flume HDFS Sink generates lots of tiny files on HDFS

阅读更多关于 Flume HDFS Sink generates lots of tiny files on HDFS

I have a toy setup sending log4j messages to hdfs using flume. I'm not able to configure the hdfs sink to avoid many small files. I thought I could configure the hdfs sink to create a new file every-time the file size reaches 10mb, but it is still creating files around 1.5KB. Here is my current flume config: a1.sources=o1 a1.sinks=i1 a1.channels=c1 #source configuration a1.sources.o1.type=avro a1.sources.o1.bind=0.0.0.0 a1.sources.o1.port=41414 #sink config a1.sinks.i1.type=hdfs a1.sinks.i1.hdfs.path=hdfs://localhost:8020/user/myName/flume/events #never roll-based on time a1.sinks.i1.hdfs

Is it possible to write Flume headers to HDFS sink and drop the body?

阅读更多关于 Is it possible to write Flume headers to HDFS sink and drop the body?

The text_with_headers serializer (HDFS sink serializer) allows to save the Flume event headers rather than discarding them. The output format consists of the headers, followed by a space, then the body payload. We would like to drop the body and retain the headers only. For the HBase sink, the "RegexHbaseEventSerializer" allows us to transform the events. But I am unable to find such a provision for the HDFS sink. You can set serializer property to header_and_text , which outputs both the headers and the body. For example: agent.sinks.my-hdfs-sink.type = hdfs agent.sinks.my-hdfs-sink.hdfs

Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

阅读更多关于 Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

There is tiny problem when I try Cloudera 5.4.2. Base on this article Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm It tries to fetching tweets using Flume and twitter streaming for data analysis. All things are happy, create Twitter app, create directory on HDFS, configure Flume then start to fetch data, create schema on top of tweets. Then, here is the problem. Twitter streaming converts tweets to Avro format and send Avro events to downsteam HDFS sinks, when Hive table backed by Avro load the data, I got the error message said

Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

阅读更多关于 Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

问题 There is tiny problem when I try Cloudera 5.4.2. Base on this article Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm It tries to fetching tweets using Flume and twitter streaming for data analysis. All things are happy, create Twitter app, create directory on HDFS, configure Flume then start to fetch data, create schema on top of tweets. Then, here is the problem. Twitter streaming converts tweets to Avro format and send Avro events