可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

There is tiny problem when I try Cloudera 5.4.2. Base on this article

Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm

It tries to fetching tweets using Flume and twitter streaming for data analysis. All things are happy, create Twitter app, create directory on HDFS, configure Flume then start to fetch data, create schema on top of tweets.

Then, here is the problem. Twitter streaming converts tweets to Avro format and send Avro events to downsteam HDFS sinks, when Hive table backed by Avro load the data, I got the error message said "Avro block size is invalid or too large".

Oh, what is avro block and the limitation of the block size? Can I change it? What does it mean according to this message? Is it file's fault? Is it some records' fault? If Twitter's streaming met error data, it should core down. If it is all good to convert the tweets to Avro format, reversely, the Avro data should be read correctly, right?

And I also try the avro-tools-1.7.7.jar

The same problem. I google it a lot, no answers at all.

Could anyone give me a solution if you have met this problem too? Or somebody help to give a clue if you fully understand Avro stuff or Twitter streaming underneath.

It is really intereting problem. Think about it.

回答1:

Use Cloudera TwitterSource

Otherwise will meet this problem.

Unable to correctly load twitter avro data into hive table

In the article: This is apache TwitterSource

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource Twitter 1% Firehose Source This source is highly experimental. It connects to the 1% sample Twitter Firehose using streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.

But it should be cloudera TwitterSource:

https://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/

http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

And not just download the pre build jar, because our cloudera version is 5.4.2, otherwise you will get this error:

Cannot run Flume because of JAR conflict

You should compile it using maven

https://github.com/cloudera/cdh-twitter-example

Download and compile: flume-sources.1.0-SNAPSHOT.jar. This jar contains the implementation of Cloudera TwitterSource.

Steps:

wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip

sudo yum install apache-maven Put to flume plugins directory:

/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar

mvn package

Notice: Yum update to latest version, otherwise compile (mvn package) fails due to some security problem.

文章来源: Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

标签

flume

Cloudera