flume

Transferring files from remote node to HDFS with Flume

你离开我真会死。 提交于 2019-12-05 02:03:19
I have a bunch of binary files compressed into *gz format. These are generated on a remote node and must be transferred to HDFS located one of the datacenter's server. I'm exploring the option of sending the files with Flume; I explore the option of doing this with a Spooling Directory configuration, but apparently this only works when the file's directory is located locally on the same HDFS node. Any suggestions how to tackle this problem? arghtype There is no out-of-box solution for such case. But you could try these workarounds: You could create your own source implementation for such

大数据开发参考资料

江枫思渺然 提交于 2019-12-05 01:48:17
参考链接出处: https://www.cnblogs.com/Thomas-blog/p/9728179.html 相关PDF电子版: 链接: https://pan.baidu.com/s/1X_e4koNHs43tdUsF0Kd0Bg 提取码:7a3l 复制这段内容后打开百度网盘手机App,操作更方便哦 一、大数据开发工程师技能图 必须掌握的技能11条 Java高级(虚拟机、并发) Linux 基本操作 Hadoop(HDFS+MapReduce+Yarn ) HBase(JavaAPI操作+Phoenix ) Hive(Hql基本操作和原理理解) Kafka Storm/JStorm Scala Python Spark (Core+sparksql+Spark streaming ) 辅助小工具(Sqoop/Flume/Oozie/Hue等) 高阶技能6条 机器学习算法以及mahout库加MLlib R语言 Lambda 架构 Kappa架构 Kylin Alluxio 二、参考资料 1)Java 高级学习(《深入理解Java虚拟机》、《Java高并发实战》)—30小时 2)Zookeeper学习(可以参照这篇博客进行学习: http://www.cnblogs.com/wuxl360/p/5817471.html ) Zookeeper分布式协调服务介绍。

unable to download data from twitter through flume

混江龙づ霸主 提交于 2019-12-04 22:43:25
bin/flume-ng agent -n TwitterAgent --conf ./conf/ -f conf/flume-twitter.conf -Dflume.root.logger=DEBUG,console When I run the above command it generate the following errors: 2016-05-06 13:33:31,357 (Twitter Stream consumer-1[Establishing connection]) [INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] 404:The URI requested is invalid or the resource requested, such as a user, does not exist. Unknown URL. See Twitter Streaming API documentation at http://dev.twitter.com/pages/streaming_api This is my flume-twitter.conf file located in flume/conf folder: TwitterAgent

flume 示例一收集tomcat日志

好久不见. 提交于 2019-12-04 17:32:11
例子场景描述:将tomcat的日志收集到指定的目录,tomcat 安装在/opt/tomcat, 日志存放在var/log/data。 配置tomcat.conf 如下: # A single-node Flume configuration # Name the components on this agent agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # Describe/configure source1 agent1.sources.source1.type = exec agent1.sources.source1.command = tail -n +0 -F /opt/tomcat/logs/ catalina.out agent1.sources.source1.channels = channel1 # Describe sink1 agent1.sinks.sink1.type = file_roll agent1.sinks.sink1.sink.directory=/var/log/data # Use a channel which buffers events in memory agent1.channels.channel1.type = file

flume+kafka测试(监控File)

孤街浪徒 提交于 2019-12-04 17:20:44
1.flume与kafka整合 1.下载插件包 Flume和Kafka插件包下载: https://github.com/beyondj2ee/flumeng-kafka-plugin 2.复制jar包 复制插件包中的jar包到flume/lib中 (删掉不同版本相同jar包,需删除scala-compiler-z.9.2.jar包,否则flume启动会出现问题) 复制kafka/libs中的jar包到flume/lib中 2.配置 flume 配置文件(监控 file ) vi /opt/flume/conf/hw.conf agent.sources = s1 agent.channels = c1 agent.sinks = k1 agent.sources.s1.type=exec agent.sources.s1.command=tail -F /opt/log/debug.log agent.sources.s1.channels=c1 agent.channels.c1.type=memory agent.channels.c1.capacity=10000 agent.channels.c1.transactionCapacity=100 #设置Kafka接收器 agent.sinks.k1.type= org.apache.flume.sink.kafka

Flume - Can an entire file be considered an event in Flume?

安稳与你 提交于 2019-12-04 13:57:58
问题 I have a use case where I need to ingest files from a directory into HDFS. As a POC, I used simple Directory Spooling in Flume where I specified the source, sink and channel and it works fine. The disadvantage is that I would have to maintain multiple directories for multiple file types that go into distinct folders in order to get greater control over file sizes and other parameters, while making configuration repetitive, but easy. As an alternative, I was advised to use regex interceptors

Can apache flume hdfs sink accept dynamic path to write?

情到浓时终转凉″ 提交于 2019-12-04 12:35:21
问题 I am new to apache flume. I am trying to see how I can get a json (as http source), parse it and store it to a dynamic path on hdfs according to the content. For example: if the json is: [{ "field1" : "value1", "field2" : "value2" }] then the hdfs path will be: /some-default-root-path/value1/value2/some-value-name-file Is there such configuration of flume that enables me to do that? Here is my current configuration (accepts a json via http, and stores it in a path according to timestamp):

Using an HDFS Sink and rollInterval in Flume-ng to batch up 90 seconds of log information

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-04 09:43:09
问题 I am trying to use Flume-ng to grab 90 seconds of log information and put it into a file in HDFS. I have flume working to look at the log file via an exec and tail however it is creating a file every 5 seconds instead of what I am trying to configure as every 90 seconds. My flume.conf is as follows: # example.conf: A single-node Flume configuration # Name the components on this agent agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # Describe/configure source1 agent1

好程序员大数据教程分享HadoopHDFS操作命令总结

点点圈 提交于 2019-12-04 06:51:11
  好程序员大数据教程分享HadoopHDFS操作命令总结   1.列出根目录下所有的目录或文件   hadoopfs-ls/   2.列出/logs目录下的所有目录和文件   hadoopfs-ls/logs   3.列出/user目录及其子目录下的所有文件(谨慎使用)   hadoopfs-ls-R/user   4.创建/soft目录   hadoopfs-mkdir/soft   5.创建多级目录   hadoopfs-mkdir-p/apps/windows/2017/01/01   6.将本地的wordcount.jar文件上传到/wordcount目录下   hadoopfs-putwordcount.jar/wordcount   7.将/stu/students.txt文件拷贝到本地   hadoopfs-copyToLocal/stu/students.txt   8.将word.txt文件拷贝到/wordcount/input/目录   hadoopfs-copyFromLocalword.txt/wordcount/input   9.将word.txt文件从本地移动到/wordcount/input/目录下   hadoopfs-moveFromLocalword.txt/wordcount/input/   10.将/stu/students

Hadoop:你不得不了解的大数据工具

送分小仙女□ 提交于 2019-12-04 03:03:53
如今Apache Hadoop已成为大数据行业发展背后的驱动力。Hive和Pig等技术也经常被提到,但是它们都有什么功能,为什么会需要奇怪的名字(如Oozie,ZooKeeper、Flume)。 Hadoop带来了廉价的处理大数据(大数据的数据容量通常是10-100GB或更多,同时数据种类多种多样,包括结构化、非结构化等)的能力。但这与之前有什么不同? 现今企业数据仓库和关系型数据库擅长处理结构化数据,并且可以存储大量的数据。但成本上有些昂贵。这种对数据的要求限制了可处理的数据种类,同时这 种惯性所带的缺点还影响到数据仓库在面对海量异构数据时对于敏捷的探索。这通常意味着有价值的数据源在组织内从未被挖掘。这就是Hadoop与传统数据处 理方式最大的不同。 本文就重点探讨了Hadoop系统的组成部分,并解释各个组成部分的功能。 MapReduce——Hadoop的核心 ( 趣文推荐:《 我是如何向老婆解释MapReduce的? 》 ) Google的网络搜索引擎在得益于算法发挥作用的同时,MapReduce在后台发挥了极大的作用。MapReduce框架成为当今大数据处理背 后的最具 影响力 的“发动机”。除了Hadoop,你还会在MapReduce上发现MPP(Sybase IQ推出了列示数据库)和NoSQL(如Vertica和MongoDB)。