flume

flume部署

那年仲夏 提交于 2019-12-17 04:26:21
参考: 笔记 https://www.cnblogs.com/yinzhengjie/p/11183988.html 官网: http://flume.apache.org/documentation.html user guide: https://github.com/apache/flume/blob/trunk/flume-ng-doc/sphinx/FlumeUserGuide.rst 来源: https://www.cnblogs.com/hongfeng2019/p/11988507.html

Flume安装部署

独自空忆成欢 提交于 2019-12-17 02:11:08
Flume介绍 Flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。 Flume可以采集文件,socket数据包、文件、文件夹、kafka等各种形式源数据,又可以将采集到的数据(下沉sink)输出到HDFS、hbase、hive、kafka等众多外部存储系统中。 安装部署 1、 Flume的安装非常简单,只需要解压即可,当然,前提是已有hadoop环境 2、上传安装文件并解压 tar -zxvf flume-ng-1.6.0-cdh5.14.0.tar.gz -C /export/servers/ 3、进入解压后的目录中的conf里 cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf 4、 拷贝一份flume-env.sh cp flume-env.sh.template flume-env.sh 5、 进入flume-env.sh修改 vim flume-env.sh export JAVA_HOME=/export/servers/jdk1.8.0_141 来源: CSDN 作者: Dreamy_zsy 链接: https://blog.csdn.net/Dreamy_zsy/article/details/103570243

flume监控之ganglia

≡放荡痞女 提交于 2019-12-16 18:19:57
对于日志来说,我觉得监控意义不大,因为写的速度一般不会特别快,但是如果是spooldir source,里面一小时放入十几G的数据让flume解析,特别是在结合kafka或者其他框架的话,监控就显得重要了,可以分析整个架构的瓶颈 flume的监控是基于json的,通过jmx产生metrix数据,可以通过web直接访问得到json数据,但是不够直观,也可以交由其他监控框架接收展示,官网上就简单描述了ganglia的方式 安装并启动ganglia(http://www.cnblogs.com/admln/p/ganglia-install-yum.html)之后,不需要再配置ganglia。让flume向ganglia发送metrix一种配置方式是在flume-env.conf中配置,这样启动所有任务都向ganglia发送,也可以直接在启动一个application的时候指定 $ bin/flume-ng agent --conf-file example.conf --name myname -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=myhost:8649 个人感觉展现的不是特别好,但是相对于直接看json还是将就了。 网上还有种说法是交由zabbix展示,不过看了看美团的图(http://tech

最全的大数据技术大合集:Hadoop家族、Cloudera系列、spark

ぐ巨炮叔叔 提交于 2019-12-16 06:55:20
大数据我们都知道Hadoop,可是还会各种各样的技术进入我们的视野:Spark,Storm,impala,让我们都反映不过来。为了能够更好的架构大数据项目,这里整理一下,供技术人员,项目经理,架构师选择合适的技术,了解大数据各种技术之间的关系,选择合适的语言。 我们可以带着下面问题来阅读本文章: 1.hadoop都包含什么技术? 2.Cloudera公司与hadoop的关系是什么,都有什么产品,产品有什么特性? 3.Spark与hadoop的关联是什么? 4.Storm与hadoop的关联是什么? hadoop家族 创始人:Doug Cutting 整个Hadoop家族由以下几个子项目组成: Hadoop Common: Hadoop体系最底层的一个模块,为Hadoop各子项目提供各 种工具,如:配置文件和日志操作等。 HDFS: 是Hadoop应用程序中主要的分布式储存系统, HDFS集群包含了一个NameNode(主节点),这个节点负责管理所有文件系统的元数据及存储了真实数据的DataNode(数据节点,可以有很多)。HDFS针对海量数据所设计,所以相比传统文件系统在大批量小文件上的优化,HDFS优化的则是对小批量大型文件的访问和存储。 MapReduce: 是一个软件框架,用以轻松编写处理海量(TB级)数据的并行应用程序,以可靠和容错的方式连接大型集群中上万个节点(商用硬件)

Getting data directly from a website to a hdfs

放肆的年华 提交于 2019-12-13 10:42:33
问题 How do I get data directly which is entering on a website concurrently on hdfs? 回答1: If you plan to have High availability read and writes, then you can use Hbase to store the data. If you are using REST API, you can store the data directly to Hbase as it has dedicated Hbase REST API that can store into Hbase Tables. 1) Linear and modular scalability. 2) Strictly consistent reads and writes. 3) Automatic and configurable sharding of tables. For more about HBase :- https://hbase.apache.org/

channel lock error while configuring flume's multiple sources using FILE channels

五迷三道 提交于 2019-12-13 08:07:07
问题 Configuring multiple sources for an agent throwing me lock error using FILE channel. Below is my config file. a1.sources = r1 r2 a1.sinks = k1 k2 a1.channels = c1 c3 #sources a1.sources.r1.type=netcat a1.sources.r1.bind=localhost a1.sources.r1.port=4444 a1.sources.r2.type=exec a1.sources.r2.command=tail -f /opt/gen_logs/logs/access.log #sinks a1.sinks.k1.type=hdfs a1.sinks.k1.hdfs.path=/flume201 a1.sinks.k1.hdfs.filePrefix=netcat- a1.sinks.k1.rollInterval=100 a1.sinks.k1.hdfs.fileType

Flume: Data transferring to Server

别来无恙 提交于 2019-12-13 04:52:40
问题 I am new to Flume-ng. I have to write a program, which can transfer a text file to other program (agent). I know we must know about agent i.e. host-ip, port number etc. Then a source, sink and a channel should be defined. I just want to transfer a log file to server. My client code is as follows. public class MyRpcClientFacade { public class MyClient{ private RpcClient client; private String hostname; private int port; public void init(String hostname, int port) { this.hostname = hostname;

HDFS sink: “clever” folder routing

时间秒杀一切 提交于 2019-12-12 10:21:37
问题 I am new to Flume (and to HDFS), so I hope my question is not stupid. I have a multi-tenant application (about 100 different customers as for now). I have 16 different data types. (In production, we have approx. 15 million messages/day through our RabbitMQ) I want to write to HDFS all my events, separated by tenant, data type, and date, like this : /data/{tenant}/{data_type}/2014/10/15/file-08.csv Is it possible with one sink definition ? I don't want to duplicate configuration, and new

Apache Flume - send only new file contents

家住魔仙堡 提交于 2019-12-12 06:09:07
问题 I am a very new user to Flume, please treat me as an absolute noob. I am having a minor issue configuring Flume for a particular use case and was hoping you could assist. Note that I am not using HDFS, which is why this question is different from others you may have seen on forums. I have two Virtual Machines (VMs) connected to each other through an internal network on Oracle Virtual Box. My goal is to have one VM watch a particular directory that will only ever have one file in it. When the

Issues with Flume HDFS sink from Twitter

断了今生、忘了曾经 提交于 2019-12-12 04:45:54
问题 I currently have this configuration in Flume : # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses