how to load twitter data from hdfs using pig?

ⅰ亾dé卋堺 提交于 2019-12-24 00:33:49

问题


I just streaming some twitter data using flume and cluster it into HDFS now I try to load it into pig for analysis.As the default JsonLoader function can not load the data so I search in google for some library which can load this kind of data.I found this link and follow there instruction.

Here are the result

REGISTER '/home/hduser/Downloads/json-simple-1.1.1.jar';

2016-02-22 20:54:46,539 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

same for other tow command.

Now when I try to load my data using this command

load_tweets = LOAD '/TwitterData/' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;

It's shows me this error

2016-02-22 20:58:01,639 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve com.twitter.elephantbird.pig.load.JsonLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /home/hduser/pig-0.15.0/pig_1456153061619.log

so how to solve it and load properly?

Note:My data is about recent release movie deadpool twitter data.


回答1:


You need to register below jar in pig, this jar contains the appropriate class which you are trying to access.

elephant-bird-pig-4.1.jar

EDITED: For proper steps.

REGISTER '/home/hdfs/json-simple-1.1.jar';

REGISTER '/home/hdfs/elephant-bird-hadoop-compat-4.1.jar';

REGISTER '/home/hdfs/elephant-bird-pig-4.1.jar';

load_tweets = LOAD '/user/hdfs/twittes.txt' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;

dump load_tweets;

I used above steps on my local cluster and its working fine, so you need to add these jars before running your load.




回答2:


You need to Register 3 Jar files as shown in the blog. Each jar has its own importance.

elephant-bird-hadoop-compat-4.1.jar-Utilities for dealing with Hadoop incompatibilities between 1.x and 2.x.

elephant-bird-pig-4.1.jar--Json loader for pig, it loads each Json record into Pig.

json-simple-1.1.1.jar--One of the Json Parser available in Java

After Registering the Jars, you can load the tweets by the following pig script.

load_tweets = LOAD '/user/flume/tweets/' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;

After loading the tweets, you can see them by dumping it

dump load_tweets


来源:https://stackoverflow.com/questions/35557555/how-to-load-twitter-data-from-hdfs-using-pig

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!