Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

匿名 (未验证) 提交于 2019-12-03 01:12:01

问题:

There is tiny problem when I try Cloudera 5.4.2. Base on this article

Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm

It tries to fetching tweets using Flume and twitter streaming for data analysis. All things are happy, create Twitter app, create directory on HDFS, configure Flume then start to fetch data, create schema on top of tweets.

Then, here is the problem. Twitter streaming converts tweets to Avro format and send Avro events to downsteam HDFS sinks, when Hive table backed by Avro load the data, I got the error message said "Avro block size is invalid or too large".

Oh, what is avro block and the limitation of the block size? Can I change it? What does it mean according to this message? Is it file's fault? Is it some records' fault? If Twitter's streaming met error data, it should core down. If it is all good to convert the tweets to Avro format, reversely, the Avro data should be read correctly, right?

And I also try the avro-tools-1.7.7.jar

The same problem. I google it a lot, no answers at all.

Could anyone give me a solution if you have met this problem too? Or somebody help to give a clue if you fully understand Avro stuff or Twitter streaming underneath.

It is really intereting problem. Think about it.

回答1:

Use Cloudera TwitterSource

Otherwise will meet this problem.

Unable to correctly load twitter avro data into hive table

In the article: This is apache TwitterSource

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource Twitter 1% Firehose Source This source is highly experimental. It connects to the 1% sample Twitter Firehose using streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink. 

But it should be cloudera TwitterSource:

https://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/

http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource 

And not just download the pre build jar, because our cloudera version is 5.4.2, otherwise you will get this error:

Cannot run Flume because of JAR conflict

You should compile it using maven

https://github.com/cloudera/cdh-twitter-example

Download and compile: flume-sources.1.0-SNAPSHOT.jar. This jar contains the implementation of Cloudera TwitterSource.

Steps:

wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip

sudo yum install apache-maven Put to flume plugins directory:

/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar  

mvn package

Notice: Yum update to latest version, otherwise compile (mvn package) fails due to some security problem.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!