flume | 易学教程

（二）数据采集——Flume

阅读更多关于（二）数据采集——Flume

文章目录一、Flume概述 1. 引言 2. 数据源二、Flume架构 1. 架构图 2. 组件及其功能 3. Flume运行流程 4. Flume核心组件 Source Channel Sink 三、Flume安装 1. 运行环境 2. 安装步骤四、Flume使用入门 1. 配置文件 2. 启动Flume 五、Flume和log4j集成 1. 依赖 2. 配置日志文件 3. 配置flume配置文件 4. 启动运行 5. 查看结果六、多级数据采集结构 1. 多级串联 2. 多级数据采集结构一、Flume概述 1. 引言 Flume是一个高可用、高可靠、分布式的海量日志采集、聚合和传输的系统，可用于从不同来源的系统中采集、汇总和传输大容量的日志数据到指定的数据存储中。 2. 数据源 Flume的采集源包括：console、avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy等。二、Flume架构 1. 架构图 2. 组件及其功能组件功能 Source 从Client收集数据，传递给Channel。不同的Source可以接受不同的数据格式 Channel 是一个存储池，连接 sources 和 sinks

Create hive table error to load Twitter data

阅读更多关于 Create hive table error to load Twitter data

问题 I am trying to create external table and trying to load twitter data into table. While creating the table I am getting following error and could not able to track the error. hive> ADD JAR /usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar > ; Added [/usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar] to class path Added resources: [/usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar] hive> CREATE EXTERNAL TABLE tweets ( > id BIGINT, > created_at STRING, > source STRING, > favorited BOOLEAN, >

Flume 概述

阅读更多关于 Flume 概述

简介 Flume是一种分布式的、可靠的、可用的服务，用于有效地收集、聚合和移动大量的日志数据。它具有简单灵活基于流数据的架构，采用了许多故障转移和恢复机制来保证可靠性。架构 Source Source是负责接收数据到Flume Agent的组件。Source组件可以处理各种类型、各种格式的日志数据，包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy。 Channel Channel是位于Source和Sink之间的缓冲区。因此，Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的，可以同时处理几个Source的写入操作和几个Sink的读取操作。 Flume有两种Channel：Memory Channel和File Channel。 Memory Channel将数据缓存在内存中，当程序宕机或者重启时缓存中的数据会丢失。 File Channel将所有数据写到磁盘。因此在程序重启或机器宕机的情况下不会丢失数据。 Sink Sink不断地轮询Channel中的事件且批量地移除它们，并将这些事件批量写入到目的存储或发送到另一个Flume Agent。 Sink组件目的地包括hdfs、logger、avro、thrift、ipc

Flume configuration to upload files with same name

阅读更多关于 Flume configuration to upload files with same name

问题 I have 10 files with some data varying in length.I would like to store corresponding data in same file and with same filename, but flume is splitting up the data and saving as FlumeData.timestamp. I am using the configuration as below: a1.sources = r1 a1.sinks = k2 a1.channels = c1 a1.channels.c1.type = file a1.channels.c1.checkpointDir = /mnt/flume/checkpoint a1.channels.c1.dataDirs = /mnt/flume/data a1.channels.c1.trackerDir = /mnt/flume/track a1.channels.c1.transactionCapacity = 10000000

Reading Flume spoolDir in parallel

阅读更多关于 Reading Flume spoolDir in parallel

问题 Since I'm not allowed to set up Flume on prod servers, I have to download the logs, put them in a Flume spoolDir and have a sink to consume from the channel and write to Cassandra. Everything is working fine. However, as I have a lot of log files in the spoolDir, and the current setup is only processing 1 file at a time, it's taking a while. I want to be able to process many files concurrently. One way I thought of is to use the spoolDir but distribute the files into 5-10 different

Zookeeper keeps getting the WARN: “caught end of stream exception”

阅读更多关于 Zookeeper keeps getting the WARN: “caught end of stream exception”

问题 I am now using a CDH-5.3.1 cluster with three zookeeper instances located in three ips: 133.0.127.40 n1 133.0.127.42 n2 133.0.127.44 n3 Everything works fine when it starts, but these days I notice that the node n2 keeps getting the WARN: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid **0x0**, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server

real time log processing using apache spark streaming

阅读更多关于 real time log processing using apache spark streaming

问题 I want to create a system where I can read logs in real time, and use apache spark to process it. I am confused if I should use something like kafka or flume to pass the logs to spark stream or should I pass the logs using sockets. I have gone through a sample program in the spark streaming documentation- Spark stream example. But I will be grateful if someone can guide me a better way to pass logs to spark stream. Its kind of a new turf to me. 回答1: Apache Flume may help to read the logs in

Flume HDFS Sink generates lots of tiny files on HDFS

阅读更多关于 Flume HDFS Sink generates lots of tiny files on HDFS

问题 I have a toy setup sending log4j messages to hdfs using flume. I'm not able to configure the hdfs sink to avoid many small files. I thought I could configure the hdfs sink to create a new file every-time the file size reaches 10mb, but it is still creating files around 1.5KB. Here is my current flume config: a1.sources=o1 a1.sinks=i1 a1.channels=c1 #source configuration a1.sources.o1.type=avro a1.sources.o1.bind=0.0.0.0 a1.sources.o1.port=41414 #sink config a1.sinks.i1.type=hdfs a1.sinks.i1

Is it possible to write Flume headers to HDFS sink and drop the body?

阅读更多关于 Is it possible to write Flume headers to HDFS sink and drop the body?

问题 The text_with_headers serializer (HDFS sink serializer) allows to save the Flume event headers rather than discarding them. The output format consists of the headers, followed by a space, then the body payload. We would like to drop the body and retain the headers only. For the HBase sink, the "RegexHbaseEventSerializer" allows us to transform the events. But I am unable to find such a provision for the HDFS sink. 回答1: You can set serializer property to header_and_text , which outputs both

Rebalancing issue while reading messages in Kafka

阅读更多关于 Rebalancing issue while reading messages in Kafka

问题 I am trying to read messages on Kafka topic, but I am unable to read it. The process gets killed after sometime, without reading any messages. Here is the rebalancing error which I get: [2014-03-21 10:10:53,215] ERROR Error processing message, stopping consumer: (kafka.consumer.ConsoleConsumer$) kafka.common.ConsumerRebalanceFailedException: topic-1395414642817-47bb4df2 can't rebalance after 4 retries at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance