flume-ng

Increase Flume MaxHeap

瘦欲@ 提交于 2021-02-11 12:50:33
问题 Good Afternoon, I'm having trouble increasing the Heap Size for Flume. As a result, I get: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space I've increased the heap defined in "flume-env.sh" as well as Hadoop/Yarn. No luck. One thing to notice, on starting flume, the Exec (processbuilder?) seems to be defining heap as 20Mb. Any ideas on how to override it? Info: Including Hadoop libraries found via (/usr/local/hadoop/bin/hadoop) for HDFS access Info: Including Hive

Flume HDFS sink: Remove timestamp from filename

試著忘記壹切 提交于 2020-08-06 15:12:50
问题 I have configured flume agent for my application, where source is Spooldir and sink is HDFS I am able to collect files in hdfs. agent configuration is: agent.sources = src-1 agent.channels = c1 agent.sinks = k1 agent.sources.src-1.type = spooldir agent.sources.src-1.channels = c1 agent.sources.src-1.spoolDir = /home/Documents/id/ agent.sources.src-1.deserializer=org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder agent.sources.src-1.fileHeader=true agent.channels.c1.type = file

Streaming Kafka Messages to MySQL Database

北城以北 提交于 2019-12-25 19:39:09
问题 I want to write Kafka messages to MySQL database. There is an example in this link. In that example, apache flume used for consume messages and writing it to MySQL. I'm using same code and when i run the flume-ng agent and event always becomes null And my flume.conf.properties file is: agent.sources=kafkaSrc agent.channels=channel1 agent.sinks=jdbcSink agent.channels.channel1.type=org.apache.flume.channel.kafka.KafkaChannel agent.channels.channel1.brokerList=localhost:9092 agent.channels

org.apache.kafka.common.errors.RecordTooLargeException in Flume Kafka Sink

二次信任 提交于 2019-12-24 08:49:00
问题 I am trying to read data from JMS source and pushing them into KAFKA topic, while doing that after few hours i observed that pushing frequency to the KAFKA topic became almost zero and after some initial analysis i found following exception in FLUME logs . 28 Feb 2017 16:35:44,758 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.SinkRunner$PollingRunner.run:158) - Unable to deliver event. Exception follows. org.apache.flume.EventDeliveryException: Failed to publish

Reading Flume spoolDir in parallel

旧街凉风 提交于 2019-12-22 10:51:08
问题 Since I'm not allowed to set up Flume on prod servers, I have to download the logs, put them in a Flume spoolDir and have a sink to consume from the channel and write to Cassandra. Everything is working fine. However, as I have a lot of log files in the spoolDir, and the current setup is only processing 1 file at a time, it's taking a while. I want to be able to process many files concurrently. One way I thought of is to use the spoolDir but distribute the files into 5-10 different

Flume HDFS Sink generates lots of tiny files on HDFS

对着背影说爱祢 提交于 2019-12-20 03:53:41
问题 I have a toy setup sending log4j messages to hdfs using flume. I'm not able to configure the hdfs sink to avoid many small files. I thought I could configure the hdfs sink to create a new file every-time the file size reaches 10mb, but it is still creating files around 1.5KB. Here is my current flume config: a1.sources=o1 a1.sinks=i1 a1.channels=c1 #source configuration a1.sources.o1.type=avro a1.sources.o1.bind=0.0.0.0 a1.sources.o1.port=41414 #sink config a1.sinks.i1.type=hdfs a1.sinks.i1

Is it possible to write Flume headers to HDFS sink and drop the body?

試著忘記壹切 提交于 2019-12-20 03:41:10
问题 The text_with_headers serializer (HDFS sink serializer) allows to save the Flume event headers rather than discarding them. The output format consists of the headers, followed by a space, then the body payload. We would like to drop the body and retain the headers only. For the HBase sink, the "RegexHbaseEventSerializer" allows us to transform the events. But I am unable to find such a provision for the HDFS sink. 回答1: You can set serializer property to header_and_text , which outputs both

Unable to correctly load twitter avro data into hive table

杀马特。学长 韩版系。学妹 提交于 2019-12-17 20:27:35
问题 Need your help! I am trying a trivial exercise of getting the data from twitter and then loading it up in Hive for analysis. Though I am able to get data into HDFS using flume (using Twitter 1% firehose Source) and also able to load the data into Hive table. But unable to see all the columns I have expected to be there in the twitter data like user_location, user_description, user_friends_count, user_description, user_statuses_count. The schema derived from Avro only contains two columns

channel lock error while configuring flume's multiple sources using FILE channels

五迷三道 提交于 2019-12-13 08:07:07
问题 Configuring multiple sources for an agent throwing me lock error using FILE channel. Below is my config file. a1.sources = r1 r2 a1.sinks = k1 k2 a1.channels = c1 c3 #sources a1.sources.r1.type=netcat a1.sources.r1.bind=localhost a1.sources.r1.port=4444 a1.sources.r2.type=exec a1.sources.r2.command=tail -f /opt/gen_logs/logs/access.log #sinks a1.sinks.k1.type=hdfs a1.sinks.k1.hdfs.path=/flume201 a1.sinks.k1.hdfs.filePrefix=netcat- a1.sinks.k1.rollInterval=100 a1.sinks.k1.hdfs.fileType

Apache Flume - send only new file contents

家住魔仙堡 提交于 2019-12-12 06:09:07
问题 I am a very new user to Flume, please treat me as an absolute noob. I am having a minor issue configuring Flume for a particular use case and was hoping you could assist. Note that I am not using HDFS, which is why this question is different from others you may have seen on forums. I have two Virtual Machines (VMs) connected to each other through an internal network on Oracle Virtual Box. My goal is to have one VM watch a particular directory that will only ever have one file in it. When the