apache-pig | 易学教程

Load only a few values from complex JSON object in Pig Latin

阅读更多关于 Load only a few values from complex JSON object in Pig Latin

问题 I have a complex JSON file that looks like this: http://pastebin.com/4UfadbqS I would like to load only several values from these JSON objects using Pig Latin. I tried doing that like this: mydata = LOAD 'data.json' USING JsonLoader('id:chararray, created_at:chararray, user: {(language:chararray)}’); STORE mydata INTO 'output'; But it seems that Pig Latin is just taking the first 3 values from the JSON and saving them (it does not recognize the column name as a key). Is there a way to achieve

Pig: Get index in nested foreach

阅读更多关于 Pig: Get index in nested foreach

问题 I have a pig script with code like : scores = LOAD 'file' as (id:chararray, scoreid:chararray, score:int); scoresGrouped = GROUP scores by id; top10s = foreach scoresGrouped{ sorted = order scores by score DESC; sorted10 = LIMIT sorted 10; GENERATE group as id, sorted10.scoreid as top10candidates; }; It gets me a bag like id1, {(scoreidA),(scoreidB),(scoreIdC)..(scoreIdFoo)} However, I wish to include the index of items as well, so I'd have results like id1, {(scoreidA,1),(scoreidB,2),

matrix multiplication apache pig

阅读更多关于 matrix multiplication apache pig

问题 I am trying to perform matrix multiplication in pig latin. Here's my attempt so far: matrix1 = LOAD 'mat1' AS (row,col,value); matrix2 = LOAD 'mat2' AS (row,col,value); mult_mat = COGROUP matrix1 BY row, matrix2 BY col; mult_mat = FOREACH mult_mat { A = COGROUP matrix1 BY col, matrix2 BY row; B = FOREACH A GENERATE group AS col, matrix1.value*matrix2.value AS prod; GENERATE group AS row, B.col AS col, SUM(B.prod) AS value;} However, this doesn't work. I get stopped at A = COGROUP matrix1...

How to prevent Apache pig from outputting empty files?

阅读更多关于 How to prevent Apache pig from outputting empty files?

问题 I have a pig script that reads data from a directory on HDFS. The data are stored as avro files. The file structure looks like: DIR-- --Subdir1 --Subdir2 --Subdir3 --Subdir4 In the pig script I am simply doing a load, filter and store. It looks like: items = LOAD path USING AvroStorage() items = FILTER items BY some property STORE items into outputDirectory using AvroStorage() The problem right now is that pig is outputting many empty files in the output directory. I am wondering if there's a

Pig - How to manipulate and compare dates?

阅读更多关于 Pig - How to manipulate and compare dates?

问题 I have a file which contains entries like this: 1,1,07 2012,07 2013,11,blablabla The two first fields are ids. The third is the begin date(month year) and the fourth is the end date. The fifth field is the number of months btweens these two dates. And the last field contains text. Here is my pig code to load this data: f = LOAD 'file.txt' USING PigStorage(',') AS (id1:int, id2:int, date1:chararray, date2:chararray, duration:int, text:chararray); I would like to filter my file so that I keep

How to extract keys from map?

阅读更多关于 How to extract keys from map?

问题 How do I extract all keys from a map field? I have a bag of tuples where one of the fields is a map that contains HTTP headers (and their values). I want to create a set of all possible keys (in my dataset) for a HTTP header and count how many times I've seen them. Ideally, something like: A = LOAD ... B = FOREACH A GENERATE KEYS(http_headers) C = GROUP FLATTEN(B) BY $0 D = FOREACH C GENERATE group, COUNT($0) (didn't test it but it illustrates the idea..) How do I do something like this? If I

Pig: efficient filtering by loaded list

阅读更多关于 Pig: efficient filtering by loaded list

In Apache Pig (version 0.16.x), what are some of the most efficient methods to filter a dataset by an existing list of values for one of the dataset's fields? For example, (Updated per @inquisitive_mind's tip) Input: a line-separated file with one value per line my_codes.txt '110' '100' '000' sample_data.txt '110', 2 '110', 3 '001', 3 '000', 1 Desired Output '110', 2 '110', 3 '000', 1 Sample script %default my_codes_file 'my_codes.txt' %default sample_data_file 'sample_data.txt' my_codes = LOAD '$my_codes_file' as (code:chararray) sample_data = LOAD '$sample_data_file' as (code: chararray,

Expand an array with Apache Pig

阅读更多关于 Expand an array with Apache Pig

问题 I'm analyzing data with Apache pig and could not find a way to expand an array if items. Here is the schema I'm working with, and an example of the desired output: (col1:int, col2:int, items:{ARRAY_ELEM:(name:chararray, total:int)}) input = (1, 1, {("bird", 5), ("bear", 12), ("wolf", 10)}) output = (1, 1, "bird", 5, "bear", 12, "wolf", 10) Is there any way to do this transformation? Thanks for your help! 回答1: If you need to do this transformation right now the easiest way is probably to do a

Generate multiple outputs with Hadoop Pig

阅读更多关于 Generate multiple outputs with Hadoop Pig

问题 I've got this file containing a list of data in Hadoop. I've build a simple Pig script which analyze the file by the id number , and so on... The last step I'm looking for is this: I'd like to to create (store) a file for each unique id number . So this should depend on a group step...however, I haven't understood if this is possible (maybe there is a custom store module?). Any idea? Thanks Daniele 回答1: While keeping in mind what is said by frail, MultiStorage, in PiggyBank, seems to be what

Load xlsx file into Pig

阅读更多关于 Load xlsx file into Pig

问题 Is there any way to load .xlsx files into Pig? I need to perform an operation in PIG using the excel file [.xlsx] as input, but i couldn't find any built-in functions available for this purpose.? Any help to achieve this would be appreciable. Thanks, 回答1: Try this, First convert the xlsx file into csv then do the following, REGISTER Location\to\piggybank.jar Data = load 'Location\to\csv\file' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT