apache-pig

pig how to filter distinct couples (pairs)

醉酒当歌 提交于 2019-12-11 03:15:02
问题 I am new to Pig. I have a Pig script which generates tab-separated pairs between two element. One pair for each line, for example: John Paul Tom Nik Mark Bill Tom Nik Paul John I need to filter out duplicate combinations. If I use DISTINCT, I filter out double "Tom Nik" entry. The result is: John Paul Tom Nik Mark Bill Paul John The problem with this approach is that I am left with both "John Paul" and "Paul John", which for my purposes should be treated as the same (same combination). Is

Java or Pig regex to strip out values from UserAgent string

被刻印的时光 ゝ 提交于 2019-12-11 02:39:47
问题 I need to strip out the third and subsequent values in the 'bracketed' component of the user agent string. In order to get Mozilla/4.0 (compatible; MSIE 8.0) from Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; WinTSI 06.12.2009; .NET CLR 3.0.30729; .NET4.0C) I successfully use sed command sed 's/(\([^;]\+; [^;]\+\)[^)]*)/(\1)/' I need to get the same result in Apache Pig with a Java regex. Could anybody

Pig ERROR 2998: Unhandled internal error. Static (wrong name: com/company/Static)

柔情痞子 提交于 2019-12-11 02:21:14
问题 I have a Pig script that returns a constant string value. When I try to run the script with the following command, I get a Pig ERROR 2998: pig -Dpig.additional.jars=Static.jar -f script.pig -l /dev/null -x local script.pig loaded = LOAD 'data/' USING com.twitter.elephantbird.pig.store.LzoPigStorage() AS (request); loaded = SAMPLE loaded 0.00001; sized = FOREACH loaded GENERATE Static(request); DUMP sized; What's causing the error? 回答1: It appears to be a java.lang.NoClassDefFoundError error

Run a String through Java using Pig

怎甘沉沦 提交于 2019-12-11 02:18:09
问题 I have a UDF jar which takes in a String as an input through Pig. This java file works through pig fine as running a 'hard coded' string such as this command B = foreach f generate URL_UDF.mathUDF('stack.overflow'); Will give me the output I expect My question is I am trying to get information from a text file and use my UDF with it. I load a file and want to pass data within that file which I have loaded to the UDF. LoadData = load 'data.csv' using PigStorage(','); f = foreach LoadData

Multi-line JSON read using Apache PIG

谁说胖子不能爱 提交于 2019-12-11 02:06:26
问题 I have a JSON file and want to read using Apache Pig. I tried using the regular JSONLOADER , but looks like JSONLOADER works only with single line JSON. Then I tried with Elephant-Bird . But I am still not able to see the results correctly. Can any one please suggest a solution? Input : {"employees":[ {"firstName":"John", "lastName":"Doe"}, {"firstName":"Anna", "lastName":"Smith"}, {"firstName":"Peter", "lastName":"Jones"} ]} Note : I dont want to convert the input in to a single line. Script

Apache Pig - Not able to read the bag

一曲冷凌霜 提交于 2019-12-11 01:26:13
问题 I am trying to read the comma separated data using PIG as below: grunt> cat script/pig/emp_tuple1.txt 1,kirti,250000,{(100),(200)} 2,kk,240000,{(100),(300)} 3,kumar,200000,{(200),(400)} 4,shinde,290000,{(200),(500),(300),(100)} 5,shinde k y,260000,{(100),(300),(200)} 6,amol,255000,{(300)} grunt> emp_t1 = load 'script/pig/emp_tuple1.txt' using PigStorage(',') as (empno:int, ename:chararray, salary:int, dlist:bag{}); grunt> dump emp_t1; 2015-11-23 12:26:44,450 [main] INFO org.apache.pig.backend

How do I split in Pig a tuple of many maps into different rows

喜夏-厌秋 提交于 2019-12-10 23:41:40
问题 I have a relation in Pig that looks like this: ([account_id#100, timestamp#1434, id#900], [account_id#100, timestamp#1434, id#901], [account_id#100, timestamp#1434, id#902]) As you can see, I have three map objects within a tuple. All of the data above is within the $0'th field in the relation. So the data above in a relation with a single bytearray column. The data is loaded as follows: data = load 's3://data/data' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); DESCRIBE

Selecting fields after grouping in Pig

天涯浪子 提交于 2019-12-10 20:48:52
问题 There's probably something very trivial that I'm missing, but I just can't get this to work. I have a "movies" object, with title, actor, year and role. Now what I want, is to have results with the title, along with a nested bag containing actor/role pairs. If I just do group movies by title , I end up with results like (title, {movie objects}) which would be perfect, except that the title and year also appear in the movie objects there. I want just the actor and role. I also tried foreach

Json parse with elephantbird in Pig

余生颓废 提交于 2019-12-10 20:30:10
问题 I can't get the following data to parse in Pig. It's what the twitter API returns after getting all tweets from a certain user. source data: (I removed some numbers to not invade on anyone's privacy by accident) [{"created_at":"Sat Nov 01 23:15:45 +0000 2014","id":5286804225,"id_str":"5286864225","text":"@Beace_ your nan makes me laugh with some of the things she comes out with","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a

Pig: How to join data on a key in a nested bag

╄→尐↘猪︶ㄣ 提交于 2019-12-10 19:56:08
问题 I'm simply trying to merge in the values from data2 to data1 on the 'value1'/'value2' keys seen in both data1 and data2 (note the nested structure of Easy right? In object oriented code it's a nested for loop. But in Pig it feels like solving a rubix cube. data1 = 'item1' 111 { ('thing1', 222, {('value1'),('value2')}) } data2 = 'value1' 'result1' 'value2' 'result2' A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} ); B = load 'data7'