apache-pig | 易学教程

Remove brackets and commas in output from Pig

阅读更多关于 Remove brackets and commas in output from Pig

问题 Currently my output is as below: ((130,1)) ((131,1)) ((132,1)) ((133,1)) ((137,1)) ((138,2)) ((139,1)) ((140,1)) ((142,2)) ((143,1)) I want to have it like: 130 1 131 1 132 1 My code is given below: A = LOAD 'user-links-small.txt' AS (user_a: int, user_b: int); B = ORDER A BY user_a; grouped = COGROUP B BY user_a; C = FOREACH grouped GENERATE COUNT(B); D = COGROUP C BY $0; E = FOREACH D GENERATE($0, COUNT($1)); DUMP E; I was looking through these forums, and some suggested that the way to

pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

阅读更多关于 pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

问题 Suppose I have the following flat file on HDFS (let's call this key_value): 1,1,Name,Jack 1,1,Title,Junior Accountant 1,1,Department,Finance 1,1,Supervisor,John 2,1,Title,Vice President 2,1,Name,Ron 2,1,Department,Billing Here is the output I'm looking for: (1,1,Department,Finance,Name,Jack,Supervisor,John,Title,Junior Accountant) (2,1,Department,Billing,Name,Ron,,,Title,Vice President) In other words, the first two columns form a unique identifier (similar to a composite key in db

How to use .jar in a pig file

阅读更多关于 How to use .jar in a pig file

问题 I have two input files smt.txt and smo.txt. The jar file reads the text files and split the data according to some rule which is described in java file. And the pig file takes these data put into output files with doing mapreduce. register 'maprfs:///user/username/fl.jar'; DEFINE FixedLoader fl(); mt = load 'maprfs:///user/username/smt.txt' using FixedLoader('-30','30-33',...........) AS (.........); mo = load 'maprfs:///user/username/smo.txt*' using FixedLoader('-30','30-33',.....) AS (.....

elephantbird registered still showing error 2998

阅读更多关于 elephantbird registered still showing error 2998

问题 grunt> register '/home/piyush/Desktop/pro/json-simple-1.1.1.jar' grunt> register '/home/piyush/Desktop/pro/elephant-bird-pig-4.1.jar' grunt> register '/home/piyush/Desktop/pro/elephant-bird-hadoop-compat-4.1.jar' grunt> register '/home/piyush/Desktop/pro/elephant-bird-core-4.1.jar' grunt> load_tweets = LOAD '/home/piyush/Desktop/pro/quattr.txt' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap; 2017-01-26 07:16:29,631 [main] ERROR org.apache.pig.tools.grunt.Grunt -

Pig - how to iterate on a bag of maps

阅读更多关于 Pig - how to iterate on a bag of maps

问题 Let me explain the problem. I have this line of code: u = FOREACH persons GENERATE FLATTEN($0#'experiences') as j; dump u; which produces this output: ([id#1,date_begin#12 2012,description#blabla,date_end#04 2013],[id#2,date_begin#02 2011,description#blabla2,date_end#04 2013]) ([id#1,date_begin#12 2011,description#blabla3,date_end#04 2012],[id#2,date_begin#02 2010,description#blabla4,date_end#04 2011]) Then, when I do this: p = foreach u generate j#'id', j#'description'; dump p; I have this

PIG: How to remove '::' in the column name

阅读更多关于 PIG: How to remove '::' in the column name

问题 I have a pig relation like below: FINAL= {input_md5::type: chararray,input_md5::name: chararray,input_md5::id: long,input_md5::age: chararray,test_1:: type: chararray,test_2::name:chararray} I am trying to store all columns for input_md5 relation to a hive table. like all input_md5::type: chararray,input_md5::name: chararray,input_md5::id: long,input_md5::age: chararray not taking test_1:: type: chararray,test_2::name:chararray is there any command in pig which filters only columns of input

how to load twitter data from hdfs using pig?

阅读更多关于 how to load twitter data from hdfs using pig?

问题 I just streaming some twitter data using flume and cluster it into HDFS now I try to load it into pig for analysis.As the default JsonLoader function can not load the data so I search in google for some library which can load this kind of data.I found this link and follow there instruction. Here are the result REGISTER '/home/hduser/Downloads/json-simple-1.1.1.jar'; 2016-02-22 20:54:46,539 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead,

Merge two lines in Pig

阅读更多关于 Merge two lines in Pig

问题 I would like to write a pig script for below query. Input is: ABC,DEF,, ,,GHI,JKL MNO,PQR,, ,,STU,VWX Output should be: ABC,DEF,GHI,JKL MNO,PQR,STU,VWX Could anyone please help me? 回答1: It will be difficult to solve this problem using native pig. One option could be download the datafu-1.2.0.jar library and try the below approach. input.txt ABC,DEF,, ,,GHI,JKL MNO,PQR,, ,,STU,VWX PigScript: REGISTER /tmp/datafu-1.2.0.jar; DEFINE BagSplit datafu.pig.bags.BagSplit(); A = LOAD 'input.txt' USING

Parse Complex JSON String in Pig

阅读更多关于 Parse Complex JSON String in Pig

问题 I want to parse a string of complex JSON in Pig. Specifically, I want Pig to understand my JSON array as a bag instead of as a single chararray. When using JsonLoader, I can do this easily by specifying the schema, as in this question. Is there any way to either have Pig figure out my schema for me, or to specify it when Pig is parsing a string? I've been using JsonStringToMap, but can't find a way to specify Schema, or to have it properly understand my JSON array is an array and not a single

How do I read in a list of bags in Pig?

阅读更多关于 How do I read in a list of bags in Pig?

问题 How do I read in a list of bags in Pig? I tried: grunt> cat sample.txt {a,b},{},{c,d} grunt> data = LOAD 'sample.txt' AS (a:bag{}, b:bag{}, c:bag{}); grunt> DUMP data ({},,) 回答1: The default method for reading data into Pig is PigStorage('\t') -- that is, it assumes your data is tab-separated. Yours is comma-separated. So you should write LOAD 'sample.txt' USING PigStorage(',') AS... . However, your data is not in proper Pig bag format. Remember that a bag is a collection of tuples. If you