flume

Using an HDFS Sink and rollInterval in Flume-ng to batch up 90 seconds of log information

百般思念 提交于 2019-12-03 03:51:55
I am trying to use Flume-ng to grab 90 seconds of log information and put it into a file in HDFS. I have flume working to look at the log file via an exec and tail however it is creating a file every 5 seconds instead of what I am trying to configure as every 90 seconds. My flume.conf is as follows: # example.conf: A single-node Flume configuration # Name the components on this agent agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # Describe/configure source1 agent1.sources.source1.type = exec agent1.sources.source1.command = tail -f /home/cloudera/LogCreator/fortune

Flume - Can an entire file be considered an event in Flume?

匿名 (未验证) 提交于 2019-12-03 02:54:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have a use case where I need to ingest files from a directory into HDFS. As a POC, I used simple Directory Spooling in Flume where I specified the source, sink and channel and it works fine. The disadvantage is that I would have to maintain multiple directories for multiple file types that go into distinct folders in order to get greater control over file sizes and other parameters, while making configuration repetitive, but easy. As an alternative, I was advised to use regex interceptors where multiple files would reside in a single

What's the difference between Flume and Sqoop?

一个人想着一个人 提交于 2019-12-03 01:30:53
Both Flume and Sqoop are meant for data movement, then what is the difference between them? Under what condition should I use Flume or Sqoop? From http://flume.apache.org/ Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Flume helps to collect data from a variety of sources, like logs, jms, Directory etc. Multiple flume agents can be configured to collect high volume of data. It scales horizontally. From http://sqoop.apache.org/ Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data

Flume HDFS sink keeps rolling small files

匿名 (未验证) 提交于 2019-12-03 01:13:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'm trying to stream twitter data into hdfs using flume and this: https://github.com/cloudera/cdh-twitter-example/ Whatever I try here, it keeps creating files in HDFS that range in size from 1.5kB to 15kB where I would like to see large files (64Mb). Here is the agent configuration: TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sources.Twitter.consumerKey =

Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

匿名 (未验证) 提交于 2019-12-03 01:12:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: There is tiny problem when I try Cloudera 5.4.2. Base on this article Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm It tries to fetching tweets using Flume and twitter streaming for data analysis. All things are happy, create Twitter app, create directory on HDFS, configure Flume then start to fetch data, create schema on top of tweets. Then, here is the problem. Twitter streaming converts tweets to Avro format and send Avro events to downsteam HDFS sinks, when Hive table backed by

Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2

匿名 (未验证) 提交于 2019-12-03 00:59:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: While compiling the Maven project the following error occured: [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ spark-streaming-flume-sink_2.10 --- [WARNING] Zinc server is not available at port 3030 - reverting to normal incremental compile [INFO] Using incremental compilation [INFO] Compiling 6 Scala sources and 3 Java sources to /home/gorlec/Desktop/test/external/flume-sink/target/scala-2.10/classes... [ERROR] /home/gorlec/Desktop/test/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink

Flume详细解析

匿名 (未验证) 提交于 2019-12-03 00:37:01
本文特别致谢参考文档,在理解基础上加以整理,分享给更多需要的人 1、Flume简介   Apache flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统,用于有效地收集、聚合和将大量日志数据从许多不同的源移动到一个集中的数据存储(如文本、HDFS、Hbase等)。   其使用不仅仅限于日志数据聚合。因为数据源是可定制的(内置Avro,Thrift Syslog,Netcat),Flume可以用于传输大量事件数据,包括但不限于网络流量数据、社交媒体生成的数据、电子邮件消息和几乎所有可能的数据源。 2、Flume核心 工作过程 :   flume的数据流由事件(Event)贯穿始终。事件是Flume的基本数据单位,它携带日志数据(字节数组形式)并且携带有头信息,这些Event由Agent外部的Source生成,当Source捕获事件后会进行特定的格式化,然后Source会把事件推入(单个或多个)Channel中。你可以把Channel看作是一个缓冲区,它将保存事件直到Sink处理完该事件。Sink负责持久化日志或者把事件推向另一个Source。 几个概念 Client:Client生产数据,运行在一个独立的线程。 Event: 一个数据单元,消息头和消息体组成。(Events可以是日志记录、 avro 对象等。) Flow: Event从源点到达目的点的迁移的抽象。

flume将数据导入hbase

匿名 (未验证) 提交于 2019-12-03 00:32:02
1 将hbase的lib目录下jar拷贝到flume的lib目录下; 2 在hbase中创建存储数据的表 hbase(main): 002 : 0 > create 'test_idoall_org' , 'uid' , 'name' 3 创建flume配置文件 vi.conf a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /home/hadoop/data.txt a1.sources.r1.port = 5555 a1.sources.r1.host = master a1.sources.r1.channels = c1 # Describe the sink a1.sinks.k1.type = logger a1.sinks.k1.type = hbase a1.sinks.k1.table = test_idoall_org a1.sinks.k1.columnFamily = name a1.sinks.k1.column = idoall a1.sinks.k1.serializer = org.apache.flume.sink.hbase

Flume Kafka Source、正则拦截器、HDFS Sink

匿名 (未验证) 提交于 2019-12-03 00:29:01
Flume中常用Kafka Source、正则拦截器、HDFS Sink,这里把需要注意的点做一下总结,并实现数据根据事件时间和事件类型落到HDFS。 Kafka Source # source类型 agent .sources .s 1 .type = org .apache .flume .source .kafka .KafkaSource # kafka brokers列表 agent .sources .s 1 .kafka .bootstrap .servers = localhost: 9092 # 配置消费的kafka topic agent .sources .s 1 .kafka .topics = testTopic3 # 配置消费者组的id agent .sources .s 1 .kafka .consumer .group .id = consumer_testTopic3 # 自动提交偏移量的时间间隔 agent .sources .s 1 .kafka .consumer .auto .commit .interval .ms = 60000 1、flume对从Kafka中读取到的Event默认会在Event Header中添加3个属性partition,topic,timestamp。如partition=2, topic=testTopic3,

Flume相关文档

匿名 (未验证) 提交于 2019-12-03 00:27:02
Flume相关文档 已经修改完毕的 jar ) 如果丢失,需要修改的位置: \apache-flume-1.7.0-src\flume-ng-sources\flume-taildir-source\src\main\java\org\apache\flume\source\taildir\ReliableTaildirEventReader.java Method:loadPositionFile 文章来源: Flume相关文档