apache-pig | 易学教程

Pig UDF that accept multiple inputs

阅读更多关于 Pig UDF that accept multiple inputs

问题 Quick q on Pig UDFs. I have a custom UDF that I want to accept multiple columns: package pigfuncs; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.pig.EvalFunc; import org.apache.pig.FuncSpec; import org.apache.pig.data.DataBag; import org.apache.pig.data.DataType; import org.apache.pig.data.Tuple; import org.apache.pig.impl.logicalLayer.FrontendException; import org.apache.pig.impl.logicalLayer.schema.Schema; public class DataToXML extends

Pass a relation to a PIG UDF when using FOREACH on another relation?

阅读更多关于 Pass a relation to a PIG UDF when using FOREACH on another relation?

问题 We are using Pig 0.6 to process some data. One of the columns of our data is a space-separated list of ids (such as: 35 521 225). We are trying to map one of those ids to another file that contains 2 columns of mappings like (so column 1 is our data, column 2 is a 3rd parties data): 35 6009 521 21599 225 51991 12 6129 We wrote a UDF that takes in the column value (so: "35 521 225") and the mappings from the file. We would then split the column value and iterate over each and return the first

Time differences in Apache Pig?

阅读更多关于 Time differences in Apache Pig?

问题 In a Big Data context I have a time series S1=(t1, t2, t3 ...) sorted in an ascending order. I would like to produce a series of time differences: S2=(t2-t1, t3-t2 ...) Is there a way to do this in Apache Pig? Short of a very inefficient self-join, I do not see one. If not, what would be an good way to do this suitable for large amounts of data? 回答1: S1 = Generate Id,Timestamp i.e. from t1...tn S2 = Generate Id,Timestamp i.e. from t2...tn S3 = Join S1 by Id,S2 by Id S4 = Extract S1.Timestamp

hadoop pig bag subtraction

阅读更多关于 hadoop pig bag subtraction

问题 I'm using Pig to parse my application logs to know which exposed methods have been called by a user that wasn't called the last month (by the same user). I have managed to get methods called grouped by users before last month and after last month : BEFORE last month relation sample u1 {(m1),(m2)} u2 {(m3),(m4)} AFTER last month relation sample u1 {(m1),(m3)} u2 {(m1),(m4)} What I want is to find, by users, which methods are in AFTER that are not in BEFORE, that is NEWLY_CALLED expected result

avoiding prefixes in multi relation join in pig

阅读更多关于 avoiding prefixes in multi relation join in pig

I am trying to do a star schema type of join in pig and below is my code. When I join multiple relations with different columns, I have to prefix the name of the previous join every time to get it working. I am sure there should be some better way, I am not able to find it through googling. Any pointers will be very helpful. i.e prefixing a column like this "H864::H86::hs_8_d::hs_8_desc" is what I want to avoid. hs_8 = LOAD 'hs_8_distinct' USING PigStorage('^') as (hs_8:chararray,hs_8_desc:chararray); hs_8_d = FOREACH hs_8 GENERATE SUBSTRING(hs_8,0,2) as hs_2,SUBSTRING(hs_8,0,4) as hs_4

Parsing 'Complex' JSON with Pig

阅读更多关于 Parsing 'Complex' JSON with Pig

问题 Say I have some moderately complex JSON like { "revenue": 100, "products":[ {"name": "Apple", "price": 50}, {"name": "Banana", "price": 50} ] } Obviously this this a bit contrived, but what's the best way to map this to pig using JsonLoader. I've tried a = LOAD 'test.json' USING JsonLoader('revenue:int,products:[(name:chararray,price:int)]'); or a = LOAD 'test.json' USING JsonLoader('revenue:int,products:[{(name:chararray,price:int)]}'); However, when I DUMP A , I get (100,) for both. I've

Pig: efficient filtering by loaded list

阅读更多关于 Pig: efficient filtering by loaded list

问题 In Apache Pig (version 0.16.x), what are some of the most efficient methods to filter a dataset by an existing list of values for one of the dataset's fields? For example, (Updated per @inquisitive_mind's tip) Input: a line-separated file with one value per line my_codes.txt '110' '100' '000' sample_data.txt '110', 2 '110', 3 '001', 3 '000', 1 Desired Output '110', 2 '110', 3 '000', 1 Sample script %default my_codes_file 'my_codes.txt' %default sample_data_file 'sample_data.txt' my_codes =

Pig - How to manipulate and compare dates?

阅读更多关于 Pig - How to manipulate and compare dates?

I have a file which contains entries like this: 1,1,07 2012,07 2013,11,blablabla The two first fields are ids. The third is the begin date(month year) and the fourth is the end date. The fifth field is the number of months btweens these two dates. And the last field contains text. Here is my pig code to load this data: f = LOAD 'file.txt' USING PigStorage(',') AS (id1:int, id2:int, date1:chararray, date2:chararray, duration:int, text:chararray); I would like to filter my file so that I keep only the entries where date2 is less than three years from today. Is it possible to that in Pig ? Thanks

comparing datetime in pig

阅读更多关于 comparing datetime in pig

问题 in pig 11, is there a support for comparing datetime types? for example: date1:datetime and filter has condition: date1 >= ToDate('1999-01-01') does this comparison returns correct result? 回答1: Date comparison can be considered as a numerical comparison. E.g: cat date1.txt 1999-01-01 2011-03-19 2011-02-24 2011-02-25 2011-05-23 1978-12-13 A = load 'date1.txt' as (in:chararray); B = foreach A generate ToDate(in, 'yyyy-MM-dd') as (dt:datetime); --filter dates that are equal or greater than 2011

Too many filter matching in pig

阅读更多关于 Too many filter matching in pig

问题 I have a list of filter keywords (about 1000 in numbers) and I need to filter a field of a relation in pig using this list. Initially, I have declared these keywords like: %declare p1 '. keyword1. '; .... ... %declare p1000 '. keyword1000. '; I am then doing filtering like: Filtered= FITLER SRC BY (not $0 matches '$p1') and (not $0 matches '$p2') and ...... (not $0 matches '$p1000'); DUMP Filtered; Assume that my source relation is in SRC and I need to apply filtering on first field i.e. $0.