flume

How to use flume for uploading zip files to hdfs sink

久未见 提交于 2019-12-08 07:44:49
问题 I am new to flume.My flume agent having source as http server,from where it getting zip files(compressed xml files) on regular interval.This zip files are very small (less than 10 mb) and i want to put the zip files extracted into the hdfs sink.Please share some idea how to do this.Do i have to go for a custom interceptor. 回答1: Flume will try to read your files line by line, except if you configure a specific deserializer. A deserializer lets you control how the file is parsed and split into

How to use flume for uploading zip files to hdfs sink

徘徊边缘 提交于 2019-12-08 06:40:27
I am new to flume.My flume agent having source as http server,from where it getting zip files(compressed xml files) on regular interval.This zip files are very small (less than 10 mb) and i want to put the zip files extracted into the hdfs sink.Please share some idea how to do this.Do i have to go for a custom interceptor. Flume will try to read your files line by line, except if you configure a specific deserializer. A deserializer lets you control how the file is parsed and split into events. You could of course follow the example of the blob deserizalizer, which is designed for PDFs and

Flume - Is there a way to store avro event (header & body) into hdfs?

邮差的信 提交于 2019-12-08 06:03:43
问题 New to flume... I'm receiving avro events and storing them into HDFS. I understand that by default only the body of the event is stored in HDFS. I also know there is an avro_event serializer. But I do not know what this serializer is actually doing? How does it effect the final output of the sink? Also, I can't figure out how to just dump the event into HDFS preserving its header information. Do I need to write my own serializer? 回答1: As it turns out the serializer avro_event does store both

How to handle multiline log entries in Flume

不羁的心 提交于 2019-12-08 03:32:28
I have just started playing with Flume. I have a question on how to handle log entries that are multiline, as a single event. Like stack traces during error conditions. For example, treat the below as a single event rather than one event for each line 2013-04-05 05:00:41,280 ERROR (ClientRequestPool-PooledExecutionEngine-Id#4 ) [com.ms.fw.rexs.gwy.api.service.AbstractAutosysJob] job failed for 228794 java.lang.NullPointerException at com.ms.fw.rexs.core.impl.service.job.ReviewNotificationJobService.createReviewNotificationMessageParameters(ReviewNotificationJobService.java:138) .... I have

Can I extend Flume sink to make it write different data to multiple channels?

余生颓废 提交于 2019-12-08 02:04:15
问题 A follow-up question to my previous question about Flume data flows I want to process events and send extracted data further. I'd like to accept big sized events, like zipped html > 5KB, parse them and put many slim messages, like urls found in pages, to another channel, and also some page metrics to yet another one. Since parsing pages is resource consuming, I'd rather not replicate messages to different processors for these tasks, both of which require parsing html and building DOM in

开源日志系统比较:scribe、chukwa、kafka、flume

血红的双手。 提交于 2019-12-07 13:19:47
1. 背景介绍 许多公司的平台每天会产生大量的日志(一般为流式数据,如,搜索引擎的pv,查询等),处理这些日志需要特定的日志系统,一般而言,这些系统需要具有以下特征: (1) 构建应用系统和分析系统的桥梁,并将它们之间的关联解耦; (2) 支持近实时的在线分析系统和类似于Hadoop之类的离线分析系统; (3) 具有高可扩展性。即:当数据量增加时,可以通过增加节点进行水平扩展。 本文从设计架构,负载均衡,可扩展性和容错性等方面对比了当今开源的日志系统,包括facebook的scribe,apache的chukwa,linkedin的kafka和cloudera的flume等。 2. FaceBook的Scribe Scribe是facebook开源的日志收集系统,在facebook内部已经得到大量的应用。它能够从各种日志源上收集日志,存储到一个中央存储系统 (可以是NFS,分布式文件系统等)上,以便于进行集中统计分析处理。它为日志的“分布式收集,统一处理”提供了一个可扩展的,高容错的方案。 它最重要的特点是容错性好。当后端的存储系统crash时,scribe会将数据写到本地磁盘上,当存储系统恢复正常后,scribe将日志重新加载到存储系统中。 架构 : scribe的架构比较简单,主要包括三部分,分别为scribe agent, scribe和存储系统。 (1) scribe

Invalid hostname error when connecting to s3 sink when using secret key having forward slash

给你一囗甜甜゛ 提交于 2019-12-07 10:21:49
问题 I have a forward slash in aws secret key. When I try to connect to s3 sink Caused by: java.lang.IllegalArgumentException: Invalid hostname in URI s3://xxxx:xxxx@jelogs/je.1359961366545 at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:41) When I encode forward slash with %2F , I get The request signature we calculated does not match the signature you provided. Check your key and signing method. How should I encode my secret key. 回答1: samthebest solution works, you just

Play! Akka Flume实现的完整数据收集

徘徊边缘 提交于 2019-12-06 18:59:15
前言 现如今,大数据如火如荼。针对用户行为,用户喜好等后续大数据分析也是十分火热。这个小项目实现了后台数据收集的一系列完整流程。 项目总体流程以及用到的技术 Play ! 作为web服务器,使用RESTful 规范编写接口(客户端事先埋点,然后调用接口上传数据) Play !接口接受到的记录(json形式)经过处理后,先保存到 concurrentQueue中 Play! 启动后,start一个Akka schedulable actor.他每隔一段时间,让子actor去poll queue中的数据 调用flume的封装的rpc,将数据发送到指定的端口。 Flume source端接收数据,按照配置重定向数据,sink到console. 3. 后台实现 3.1 编写接口 采用RESTful编写接口,首先在play! 的conf中routes定义接口: #run log POST /events/runlogs controllers.RunLogs.create() 然后编写controller public static Result create(){ JsonNode js = request().body().asJson(); RunLog.create(js); //return ok anyway return ok(); } 然后是model public

unable to download data from twitter through flume

心已入冬 提交于 2019-12-06 17:01:15
问题 bin/flume-ng agent -n TwitterAgent --conf ./conf/ -f conf/flume-twitter.conf -Dflume.root.logger=DEBUG,console When I run the above command it generate the following errors: 2016-05-06 13:33:31,357 (Twitter Stream consumer-1[Establishing connection]) [INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] 404:The URI requested is invalid or the resource requested, such as a user, does not exist. Unknown URL. See Twitter Streaming API documentation at http://dev.twitter.com

Flume 的配置文件

我的未来我决定 提交于 2019-12-06 14:33:42
1)在 elk-03 的/bd/flume-1.7/conf 目录下创建 kafka-flume-hdfs.conf 文件 [hadoop@elk-03 conf]$ vim kafka-flume-hdfs.conf 2 ) 在文件配置如下内容 ## 组件定义 a1.sources=r1 r2 a1.channels=c1 c2 a1.sinks=k1 k2 ## source1 ## kafka start 主题源数据 a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource a1.sources.r1.batchSize = 5000 a1.sources.r1.batchDurationMillis = 2000 a1.sources.r1.kafka.bootstrap.servers = elk-01:9092,elk-02:9092,elk-03:9092 a1.sources.r1.kafka.zookeeperConnect = elk-01:2181,elk-02:2181,elk-03:2181 a1.sources.r1.kafka.topics=topic_start ## source2 ## kafka event 主题源数据 a1.sources.r2.type = org