flume

Cannot run Flume because of JAR conflict

Deadly 提交于 2019-12-10 17:32:37
问题 I've installed Flume and Hadoop manually (I mean, not CDH) and I'm trying to run the twitter example from Cloudera. In the apache-flume-1.5.0-SNAPSHOT-bin directory, I start the agent with the following command: bin/flume-ng agent -c conf -f conf/twitter.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent My conf/twitter.conf file uses the logger as the sink. The conf/flume-env.sh assigns to CLASSPATH the flume-sources-1.0-SNAPSHOT.jar that contains the definition of the twitter source.

HDFS IO error org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 i

断了今生、忘了曾经 提交于 2019-12-10 12:01:48
问题 I am using Flume 1.6.0 in a virtual machine and Hadoop 2.7.1 in another virtual machine . When I send Avro Events to the Flume 1.6.0 and it try to write on Hadoop 2.7.1 HDFS System. The follwing exception occurs (SinkRunner-PollingRunner-DefaultSinkProcessor) [WARN - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:455)] HDFS IO error org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call

flume篇3:flume把json数据写入carbondata(flume-carbondata-sink)

。_饼干妹妹 提交于 2019-12-10 11:10:38
flume篇3:flume把json数据写入carbondata(flume-carbondata-sink) 对应非json数据同样适用,可以把非json数据通过拦截器拼接成一个以 ,分隔的string,然后send出去,这样也是ok的 废话不多说,直接上干货 一、 自定义拦截器: 1 拦截器要求:新建一个新的工程,单独打包,保证每个flume的的拦截器都是单独的一个工程打的包,这样保证每次对拦截器修改的时候不影响其他flume业务 < properties > < project . build . sourceEncoding > UTF - 8 < / project . build . sourceEncoding > < maven . compiler . source > 1.7 < / maven . compiler . source > < maven . compiler . target > 1.7 < / maven . compiler . target > < scala . version > 2.10 .4 < / scala . version > < flume . version > 1.8 .0 < / flume . version > < / properties > < dependencies > < dependency > <

Can Apache Sqoop and Flume be used interchangeably?

点点圈 提交于 2019-12-10 09:40:53
问题 I am new to Big data. From some of the answers to What's the difference between Flume and Sqoop?, both Flume and Sqoop can pull data from source and push to Hadoop. Can anyone please specify exaclty where flume is used and where sqoop is? Can both be used for the same tasks? 回答1: Flume and Sqoop are both designed to work with different kind of data sources. Sqoop works with any kind of RDBMS system that supports JDBC connectivity. Flume on the other hand works well with streaming data sources

Transferring files from remote node to HDFS with Flume

点点圈 提交于 2019-12-10 02:43:39
问题 I have a bunch of binary files compressed into *gz format. These are generated on a remote node and must be transferred to HDFS located one of the datacenter's server. I'm exploring the option of sending the files with Flume; I explore the option of doing this with a Spooling Directory configuration, but apparently this only works when the file's directory is located locally on the same HDFS node. Any suggestions how to tackle this problem? 回答1: There is no out-of-box solution for such case.

Flume和Logstash 对比

江枫思渺然 提交于 2019-12-10 00:37:04
目录 一、概述 二、一个通用的数据采集模型 三、Logstash 四、Flume 1、Flume OG 1、Flume NG 五、对比 一、概述 在某个Logstash的场景下,我产生了为什么不能用Flume代替Logstash的疑问,因此查阅了不少材料在这里总结,大部分都是前人的工作经验下,加了一些我自己的思考在里面,希望对大家有帮助。 大数据的数据采集工作是大数据技术中非常重要、基础的部分,数据不会平白无故地跑到你的数据平台软件中,你得用什么东西把它从现有的设备(比如服务器,路由器、交换机、防火墙、数据库等)采集过来,再传输到你的平台中,然后才会有后面更加复杂高难度的处理技术。 目前,Flume和Logstash是比较主流的数据采集工具(主要用于日志采集),但是很多人还不太明白两者的区别,特别是对用户来说,具体场景使用合适的采集工具,可以大大提高效率和可靠性,并降低资源成本。 我们先来看Logstash,然后看Flume 二、一个通用的数据采集模型 普适环境的数据采集 其中,数据采集和存储是必要的环节,其他并不一定需要。是不是很简单?本来编程其实就是模块化的东西,没有那么难。但是这毕竟只是一个粗略的通用模型,不同开源社区或者商业厂家开发的时候都会有自己的考虑和目的。我们在本文要讨论的Flume和Logstash原则上都属于数据采集这个范畴

How to install and configure apache flume?

陌路散爱 提交于 2019-12-09 09:46:19
问题 Am new in the Apache Flume. I need to install the flume on top of the HDFS cluster environment. I did Google it, all are saying using the cloudera distribution but I need to install and configure from the source. So can anyone please suggest me, where to start and how to customize the flume agent and sink services? 回答1: I have just installed Apache Flume 1.3 on Ubuntu. You need to download the binary zip for your OS, extract it and create a config file which is similar to properties file in

海量结构化日志分析系统

爱⌒轻易说出口 提交于 2019-12-08 18:43:51
背景 日志,角色不同,出发点和认识的角度也不同 RD使用日志,首先是为了调试程序,当程序上线后,日志是为了记录err和trace。 PM可以通过日志分析,可以得出业务指标相关的统计情况。 日志的作用大致有三:异常、trace、统计。 日志使用的痛点 使用日志时大部分的场景或特点如下: 1.日志为纯文本,即可读。 2.日志分散在各个机器上,或者同步到某一台机器。 3.某某发现一个问题,让某某去查log。 这里有几个问题,或者说提高点 1.文本冗余度太大,浪费资源,如果转换为二进制,预估有5倍的收益。 2.日志分散,查找效率低,即使集中,在没有具体时间点情况下,扫描日志会很慢。 3.日志分析难度大,谁写的日志谁来查,查到和肉眼找到还差一步... 目标 所以,我们可以设计这样一个日志系统 1.支持海量数据的日志存储(TB二进制) 2.日志二进制、结构化 3.查询速度快,难度低 设计 读写日志 日志查询过程经分解和总结共性后,几乎是下边三种情况 1.K-V查询,比如基于某个消息ID,查询一条记录。 2.集合查询,比如查询某个用户的消息记录。 3.集合range查询,比如查询某个用户0点到1点消息记录。 那么,ssdb是非常适合这三个需求的。 日志系统具有写多读少的特定,再基于磁盘存储的业界主流方案,LSM是适合的, 而SSDB的存储依赖的leveldb正好属于LSM系。 日志二进制

Implementing a Flume Sink

萝らか妹 提交于 2019-12-08 11:29:43
问题 So, I need to implement my own Flume sink. I went through this link, but my only missing part is, what do I exactly do once I am done with my Java implementation? Compile it into a .class? JAR? and how to I configure Flume to use my custom sink? Thanks 回答1: Compile it and package it into a jar. Then put the jar in <apache flume install dir>/lib . Then you can refer to your class with its fully qualified name in a sink definition. 来源: https://stackoverflow.com/questions/21856585/implementing-a

大数据(hadoop,hive,hbase,spark,flume等)各技术间的关系

妖精的绣舞 提交于 2019-12-08 09:58:32
大数据由一系列技术组成,那他们之间的关系是怎么组成的ne,请看下图: hadoop主要做了文件存储系统和提供了一个相对比较弱的mr处理数据的方案 hive是在mr和文件存储系统上面做的升级。 sprak+hbase+hadoop主要解决的是hadoop实时处理数据比较弱的问题 来源: https://www.cnblogs.com/jueshixingkong/p/12004671.html