flume

JSON data sink to Apache Phoenix with Apache Flume Error

余生长醉 提交于 2020-01-05 04:11:08
问题 I want to sink JSON data into Apache Phoenix with Apache Flume, followed an online guide http://kalyanbigdatatraining.blogspot.com/2016/10/how-to-stream-json-data-into-phoenix.html, but met the following error. How to resolve it? Many thanks! My environment list as: hadoop-2.7.3 hbase-1.3.1 phoenix-4.12.0-HBase-1.3-bin flume-1.7.0 In flume, I added phoenix sink related jars in $FLUME_HOME/plugins.d/phoenix-sink/lib commons-io-2.4.jar twill-api-0.8.0.jar twill-discovery-api-0.8.0.jar json-path

How do I transform events in Flume and send them to another channel?

情到浓时终转凉″ 提交于 2020-01-05 02:50:09
问题 Flume has some ready components to transform events before pushing them further - like RegexHbaseEventSerializer you can stick into an HBaseSink . Also, it's easy to provide a custom serializer. I want to process events and send them to the next channel. Most close to what I want is Regex Extractor Interceptor , which accepts a custom serialiser for regexp matches. But it does not substitute event body, just appends new headers with results to events, thus making output flow heavier. I'd like

Can Spool Dir of flume be in remote machine?

烈酒焚心 提交于 2020-01-04 02:45:06
问题 I was trying to fetch files from a remote machine to my hdfs whenever a new file has arrived into a particular folder. I came across the concept of spool dir in flume, and it was working fine if the spool dir is in the same machine where the flume agent is running. Is there any method to configure a spool dir in a remote machine ?? Please help. 回答1: You might be aware that flume can spawn multiple instances, i.e. you can install several flume instances which pass the data between them. So to

Error: Could not find or load main class org.apache.flume.node.Application - Install flume on hadoop version 1.2.1

守給你的承諾、 提交于 2020-01-03 05:59:07
问题 I have built a hadoop cluster which 1 master-slave node and the other is slave. And now, I wanna build a flume to get all log of the cluster on master machine. However, when I try to install flume from tarball and I always get: Error: Could not find or load main class org.apache.flume.node.Application So, please help me to find the answer, or the best way to install flume on my cluster. many thanks! 回答1: It is basically because of FLUME_HOME.. Try this command $ unset FLUME_HOME 回答2: I know

How to handle multiline log entries in Flume

笑着哭i 提交于 2020-01-03 02:49:06
问题 I have just started playing with Flume. I have a question on how to handle log entries that are multiline, as a single event. Like stack traces during error conditions. For example, treat the below as a single event rather than one event for each line 2013-04-05 05:00:41,280 ERROR (ClientRequestPool-PooledExecutionEngine-Id#4 ) [com.ms.fw.rexs.gwy.api.service.AbstractAutosysJob] job failed for 228794 java.lang.NullPointerException at com.ms.fw.rexs.core.impl.service.job

Compressed file ingestion using Flume

你说的曾经没有我的故事 提交于 2020-01-03 02:28:07
问题 Can I ingest any type of compressed file ( say zip, bzip, lz4 etc.) to hdfs using Flume ng 1.3.0? I am planning to use spoolDir. Any suggesion please. 回答1: You can ingest any type of file. You need to select an appropriate deserializer. Below route works for compressed files. You can choose the options as you need: agent.sources = src-1 agent.channels = c1 agent.sinks = k1 agent.sources.src-1.type = spooldir agent.sources.src-1.channels = c1 agent.sources.src-1.spoolDir = /tmp/myspooldir

Flume是什么,有什么作用,flume的三个组件。

落爺英雄遲暮 提交于 2020-01-02 11:23:57
Flume是一个分布式、可靠、和高可用的海量日志采集、聚和和传输的系统。支持在日志系统中定制各类数据发送方,用于收集数据。同时,Flume提供对数据进行简单处理,并写到各种数据接收方(比如文本、HDFS、Hbase等)的能力。 Flume的数据流由事件(Event)贯穿始终。事件是Flume的基本数据单位,它携带日志数据(字节数组形式)并且携带有头信息,这些Event由Agent外部的Source生成,当Source捕获事件后会进行特定的格式化,然后Source会把事件推入(单个或多个)Channel中。你可以把Channel看作是一个缓冲区,它将保存事件直到Sink处理完该事件。Sink负责持久化日志或者把事件推向另一个Source。 Flume运行的核心是Agent。Flume以Agent为最小的独立运行单位。一个agent就是一个JVM。它是一个完整的数据采集工具,含有三个核心组件,分别是Source、Channel、Sink。 来源: CSDN 作者: DimplesDimples. 链接: https://blog.csdn.net/Betty_betty_betty/article/details/103799756

Retrieving timestamp from hbase row

折月煮酒 提交于 2020-01-02 03:57:45
问题 Using Hbase API (Get/Put) or HBQL API, is it possible to retrieve timestamp of a particular column? 回答1: Assuming your client is configured and you have a table setup. Doing a get returns a Result Get get = new Get(Bytes.toBytes("row_key")); Result result_foo = table.get(get); A Result is backed by a KeyValue. KeyValues contain the timestamps. You can get either a list of KeyValues with list() or get an array with raw(). A KeyValue has a get timestamp method. result_foo.raw()[0].getTimestamp(

基于hadoop的BI架构

感情迁移 提交于 2019-12-30 01:42:16
BI系统,是企业利用数据驱动运营的一个典型系统。BI系统通过发掘企业运行过程中的数据,发现企业的潜在风险 、 为企业的各项决策提供数据支撑。 传统的BI系统通常构建于关系型数据库之上。随着企业业务量的增大和对用户行为实时提取分析的需要越来越高,传统的BI架构对实时性的分析和大数据量的分析已经无法满足,新的数据分析的解决方案便呼之欲出。 得益于hadoop对大数据和分布式计算的优势 、以及丰富的组件,使用hadoop进行BI架构便方便许多。 一个典型的基于hadoop的BI架构如下图所示: 该BI架构主要包括2部分:实时处理部分 、离线批处理部分。 实时处理部分: 功能主要是实时获取用户的网站 、app等访问记录,分析用户行为轨迹,其数据来源一般是访问日志。 数据流:通过flume实时拉取服务器的日志,并将其发送至spark和hadoop。spark利用spark structured streaming组件接收flume发送的日志数据,并在一定的窗口和周期下进行计算。提取出一些用户在此时的基本行为过程,并将结果存储至hbase。这一过程中,会涉及高频的数据读写和计算需求 、特别是flume和spark这一块的话对内存的需求量比较大,需要做好硬件配置的规划。同时将日志数据写入一份至hadoop,主要是为了做离线分析的需要。 离线批处理部分: 功能主要是对业务数据(如进销存等

Flume 实战练习

混江龙づ霸主 提交于 2019-12-26 01:08:29
前期准备 了解Flume 架构及核心组件 Flume 架构及核心组件 Source : 收集(指定数据源从哪里获取) Channel : 聚集 Sink : 输出(把数据写到哪里去) 学习使用 Flume 通过一个简单的小例子学习使用 Flume 使用 Flume 的关键就是写配置文件 配置文件的构成: A) 配置 Source B) 配置 Channel C) 配置 Sink D) 把以上三个组件串起来 A simple example 123456789101112131415161718192021222324252627282930313233 # Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1# a1: agent 的名称# r1: source 的名称# k1: sink 的名称# c1: channel 的名称# Describe/configure the sourcea1.sources.r1.type = netcata1.sources.r1.bind = localhosta1.sources.r1.port = 44444# type: source组件的类型# bind: source绑定的主机或IP# port: source绑定的端口号#