flume | 易学教程

How to read data files generated by flume from twitter

阅读更多关于 How to read data files generated by flume from twitter

问题 I have generated few twitter data log files using flume on HDFS , what is the actual format of the log file ? I was expecting data in json format. But it looks like this. Could someone help me on how to read this data ? or what is wrong with the way I have done this 回答1: DOWNLOAD THE FILE (hive-serdes-1.0-SNAPSHOT.jar) from this link http://files.cloudera.com/samples/hive-serdes-1.0-SNAPSHOT.jar Then put this file in your $HIVE_HOME/lib Add the jar into hive shell hive> ADD JAR file:///home

Pushing Router Data to DIstributed Messaging System

阅读更多关于 Pushing Router Data to DIstributed Messaging System

问题 Query: Making an interface of router as the producer of kafka cluster. Issue: My router's interface is trying to push the data to the port on which kafka is running. (by default 9092). Q. 1 But can the kafka broker accept this data without a topic being created ? Q. 2 Can a kafka consumer pull data without specifying a topic ? If yes, How ? If not, What is work around this and how can i achieve this ? 1st edit: I just checked that Kafka broker configs have "auto.create.topics.enable" field.

Writing data into flume and then to HDFS

阅读更多关于 Writing data into flume and then to HDFS

问题 I am using flume 1.5.0.1 and hadoop 2.4.1 trying to put a string into flume and save to HDFS. Flume configuration file is as follows: agentMe.channels = memory-channel agentMe.sources = my-source AvroSource agentMe.sinks = log-sink hdfs-sink agentMe.sources.AvroSource.channels = memory-channel agentMe.sources.AvroSource.type = avro agentMe.sources.AvroSource.bind = 0.0.0.0 # i tried client ip as well agentMe.sources.AvroSource.port = 41414 agentMe.channels.memory-channel.type = memory agentMe

Using NGINX to forward tracking data to Flume

阅读更多关于 Using NGINX to forward tracking data to Flume

问题 I am working on providing analytics for our web property based on instrumentation data we collect via a simple image beacon. Our data pipeline starts with Flume, and I need the fastest possible way to parse query string parameters, form a simple text message and shove it into Flume. For performance reasons, I am leaning towards nginx. Since serving static image from memory is already supported, my task is reduced to handling the querystring and forwarding a message to Flume. Hence, the

Dataingestion with Flume & Hadoop doesn't work

阅读更多关于 Dataingestion with Flume & Hadoop doesn't work

问题 I'm using Flume 1.4.0 and Hadoop 2.2.0. When I'm starting Flume and writing to HDFS I get following Exception: (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:460)] process failed java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$RenewLeaseRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet; at java.lang.ClassLoader.defineClass1

Flume拦截器、监控器

阅读更多关于 Flume拦截器、监控器

一、拦截器 1、拦截器：拦截器主要作用在source和channel之间，用于给event设置header消息头，如果没有设置拦截器，则event中只有message。常见的拦截器有： Timestamp Interceptor 时间拦截器：将时间戳插入到header中。 Host Interceptor 主机拦截器：将服务器的ip地址或者主机名插入到header中。 Regex Filtering Interceptor 正则过滤拦截器：过滤掉不需要的日志。 https://blog.csdn.net/jinywum/article/details/82598947 2、自定义拦截器：主要目的就是给日志进行分类，自定义拦截器为每个event设置header，header里标志着日志的类型。当数据传输到kafka就可以根据header知道这个日志属于哪个类型。自定义拦截器操作： a、在项目pom文件中引入flume依赖 b、找到现有的TimestampInterceptor类，copy代码到自己的自定义类里，按照需求进行修改。 c、将项目打成jar包，修改名字为app_logs_flume.jar，然后放到/opt/module/flume/lib目录下。 d、在flume配置文件指定拦截器类型。 a1.sources.r1.interceptors = i1 a1

Configure flume in shell/bash script - avoid interactive flume shell console

阅读更多关于 Configure flume in shell/bash script - avoid interactive flume shell console

问题 The normal way you configure flume is via flume master web console, way easy to talk about it here. OR via interactive flume shell console, follow the steps below: 1. $ flume shell //this brings you to the interactive flume shell console 2. in the interactive console,connect flume-master-node // this connects you to flume-master 3. in the interactive console, run "exec unconfig your_node" // this unconfig all flume configuration 4. in the interactive console, run "exec config your_node new

Why does a Flume source need to recognize the format of the message?

阅读更多关于 Why does a Flume source need to recognize the format of the message?

问题 According to the Flume documentation from here A Flume source consumes events delivered to it by an external source like a web server. The external source sends events to Flume in a format that is recognized by the target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink. Why does a Flume source need to recognize or understand the format of the message? While all it does it

Flume to read facebook page/feed/post

阅读更多关于 Flume to read facebook page/feed/post

问题 Anyone knows how to use flume so that it's reads data from a Facebook page? Actually I want to have a flume agent that reads a specific Facebook page and extracts all the information such as post/feed and push the data into Hadoop databases. 回答1: As mentioned in Flume Streaming Data from Facebook. The sentiment_analysis project has an overview containing the following: 1) Sample PHP code for the Facebook HTTP gets and posts 2) Flume configuration for a Facebook HTTP Source 3) The flume agent

how is flume distributed?

阅读更多关于 how is flume distributed?

问题 I am working with flume to ingest a ton of data into hdfs (about petabytes of data). I would like to know how is flume making use of its distributed architecture? I have over 200 servers and I have installed flume in one of them from where I would get the data from (aka data source) and the sink is the hdfs. (hadoop is running over serengeti in these servers). I am not sure whether flume distributes itself over the cluster or I have installed it incorrectly. I followed apache's user guide for