apache-pig | 易学教程

Join Multiple Relations by Different Fields

阅读更多关于 Join Multiple Relations by Different Fields

问题 Say I have three files data1 , data2 and assocs : $ cat data1 key1,foo key2,bar $ cat data2 key3,braz key4,froz $ cat assoc key1,key3 key2,key4 I load these files via $ pig -b -p debug=WARN -x local Warning: $HADOOP_HOME is deprecated. Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12 Logging error messages to: /home/vince/tmp/pig_1355407390166.log Connecting to hadoop file system at: file:/// grunt> data1 = load 'data1' using PigStorage(',') as (key: chararray, val:

Pig's Stream Through PHP

阅读更多关于 Pig's Stream Through PHP

问题 I have a Pig script--currently running in local mode--that processes a huge file containing a list of categories: /root/level1/level2/level3 /root/level1/level2/level3/level4 ... I need to insert each of these into an existing database by calling a stored procedure. Because I'm new to Pig and the UDF interface is a little daunting, I'm trying to get something done by streaming the file's content through a PHP script. I'm finding that the PHP script only sees half of the category lines I'm

Is it possible to cross-join a row in a relation with a tuple in that row in Pig?

阅读更多关于 Is it possible to cross-join a row in a relation with a tuple in that row in Pig?

问题 I have a set of data that shows users, collections of fruit they like, and home city: Alice\tApple:Orange\tSacramento Bob\tApple\tSan Diego Charlie\tApple:Pineapple\tSacramento I would like to create a pig query that correlates the number of users that enjoy tyeps of fruits in different cities, where the results from the query for the data above would look like this: Apple\tSacramento\t2 Apple\tSan Diego\t1 Orange\tSacramento\t1 Pineapple\tSacramento\t1 The part I can't figure out is how to

Avoid exception in ToDate in Pig for individual rows

阅读更多关于 Avoid exception in ToDate in Pig for individual rows

问题 I have an input as a CSV file which I am trying to process with Pig. In the csv, there is a date column which contains corrupt values for some rows. Please suggest me a mechanism to filter out those rows which are corrupt(have corrupt date column) before I apply the ToDate() function to the date column in a FOREACH...GENERATE statement. A sample format of my data is: A,21,12/1/2010 8:26 B,33,12/1/2010 8:26 C,42,i am corrupted D,30,12/1/2013 9:26 I want to be able to load this and then

How to match ',' in PIG?

阅读更多关于 How to match ',' in PIG?

问题 The below pig script gives the count of various characters in a file. It works for all characters except ',' . My code : A = load 'a.txt'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches '(.+)'; D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter; E = group D by letter; F = foreach E generate COUNT(D), group; store F into 'pigfiles/wordcount'; This matches all characters except ',' and gives an output. Input: (cat a.txt)

STRSPLIT and REGEX_EXTRACT_ALL in PigLatin

阅读更多关于 STRSPLIT and REGEX_EXTRACT_ALL in PigLatin

问题 I have a following file: File ---- 12-3 John 121 5-1 Sam 122 The file is tab( \t ) delimited. I am loading the row as line:chararray as I want the data not to be split in individual fields. And now, I want to pull and store the details (12-3, and 5-1) as separate data. I am trying with STRSPLIT and REGEX_EXTRACT_ALL , but the data doesn't seem to match. splitdata = FOREACH filedata { regex = REGEX_EXTRACT_ALL(line, '^([0-9]*)\\-([0-9]*)'); split = STRSPLIT(line, '\\t', 1); GENERATE regex,

how to combine/concat two bags in pig latin

阅读更多关于 how to combine/concat two bags in pig latin

问题 I have two datasets: A = {uid, url}; B = {uid, url}; now I do a cogroup : C = COGROUP A BY uid, B BY uid; and I want to change C into { group AS uid, DISTINCT A.url+B.url }; My question is how do I do this concatenation of two bags A.url and B.url? Or to put it differently, how do I do DISTINCT on multiple columns? 回答1: It cannot be what you're expecting but that's what I understood from your question: C = JOIN A BY uid, B BY uid; D = DISTINCT C; Concatenation is done the following way: E =

Successful task generates mapreduce.counters.LimitExceededException when trying to commit

阅读更多关于 Successful task generates mapreduce.counters.LimitExceededException when trying to commit

问题 I have a Pig script running in MapReduce mode that's been receiving a persistent error which I've been unable to fix. The script spawns multiple MapReduce applications; after running for several hours one of the applications registers as SUCCEEDED but returns the following diagnostic message: We crashed after successfully committing. Recovering. The step that causes the failure is trying to perform a RANK over a dataset that's around 100GB, split across roughly 1000 mapreduce output files

Use Hadoop Pig to load data from text file w/ each record on multiple lines?

阅读更多关于 Use Hadoop Pig to load data from text file w/ each record on multiple lines?

问题 I have my data file in the following format: U: john T: 2011-03-03 12:12:12 L: san diego, CA U: john T: 2011-03-03 12:12:12 L: san diego, CA What's the best way to read this file w/ Hadoop/pig/whatever for analysis? 回答1: Is there any way you can control the way the data is being written? Writing an process that moves this to tab separated would help you do this out of the box. Otherwise, writing a custom record reader (in Pig or Java MapReduce) might be your only option. Neither is very hard.

Store data after decompression in pig

阅读更多关于 Store data after decompression in pig

问题 Format of my file is - ({"food":"Tacos", "person":"Alice", "amount":3}) ({"food":"Tomato Soup", "person":"Sarah", "amount":2}) ({"food":"Grilled Cheese", "person":"Alex", "amount":5}) I tried to store this using the following code STORE STOCK_A INTO 'default.ash_json_pigtest' USING HCatStorer(); Stored data as shown below. {"food":"Tacos", "person":"Alice", "amount":3} None None {"food":"Tomato Soup", "person":"Sarah", "amount":2} None None {"food":"Grilled Cheese", "person":"Alex", "amount"