apache-pig

Load only a few values from complex JSON object in Pig Latin

牧云@^-^@ 提交于 2019-12-08 06:40:58
问题 I have a complex JSON file that looks like this: http://pastebin.com/4UfadbqS I would like to load only several values from these JSON objects using Pig Latin. I tried doing that like this: mydata = LOAD 'data.json' USING JsonLoader('id:chararray, created_at:chararray, user: {(language:chararray)}’); STORE mydata INTO 'output'; But it seems that Pig Latin is just taking the first 3 values from the JSON and saving them (it does not recognize the column name as a key). Is there a way to achieve

Pig: Get index in nested foreach

对着背影说爱祢 提交于 2019-12-08 06:17:24
问题 I have a pig script with code like : scores = LOAD 'file' as (id:chararray, scoreid:chararray, score:int); scoresGrouped = GROUP scores by id; top10s = foreach scoresGrouped{ sorted = order scores by score DESC; sorted10 = LIMIT sorted 10; GENERATE group as id, sorted10.scoreid as top10candidates; }; It gets me a bag like id1, {(scoreidA),(scoreidB),(scoreIdC)..(scoreIdFoo)} However, I wish to include the index of items as well, so I'd have results like id1, {(scoreidA,1),(scoreidB,2),

matrix multiplication apache pig

梦想与她 提交于 2019-12-08 06:13:55
问题 I am trying to perform matrix multiplication in pig latin. Here's my attempt so far: matrix1 = LOAD 'mat1' AS (row,col,value); matrix2 = LOAD 'mat2' AS (row,col,value); mult_mat = COGROUP matrix1 BY row, matrix2 BY col; mult_mat = FOREACH mult_mat { A = COGROUP matrix1 BY col, matrix2 BY row; B = FOREACH A GENERATE group AS col, matrix1.value*matrix2.value AS prod; GENERATE group AS row, B.col AS col, SUM(B.prod) AS value;} However, this doesn't work. I get stopped at A = COGROUP matrix1...

How to prevent Apache pig from outputting empty files?

荒凉一梦 提交于 2019-12-08 06:13:29
问题 I have a pig script that reads data from a directory on HDFS. The data are stored as avro files. The file structure looks like: DIR-- --Subdir1 --Subdir2 --Subdir3 --Subdir4 In the pig script I am simply doing a load, filter and store. It looks like: items = LOAD path USING AvroStorage() items = FILTER items BY some property STORE items into outputDirectory using AvroStorage() The problem right now is that pig is outputting many empty files in the output directory. I am wondering if there's a

Pig - How to manipulate and compare dates?

岁酱吖の 提交于 2019-12-08 05:44:10
问题 I have a file which contains entries like this: 1,1,07 2012,07 2013,11,blablabla The two first fields are ids. The third is the begin date(month year) and the fourth is the end date. The fifth field is the number of months btweens these two dates. And the last field contains text. Here is my pig code to load this data: f = LOAD 'file.txt' USING PigStorage(',') AS (id1:int, id2:int, date1:chararray, date2:chararray, duration:int, text:chararray); I would like to filter my file so that I keep

How to extract keys from map?

三世轮回 提交于 2019-12-08 05:34:23
问题 How do I extract all keys from a map field? I have a bag of tuples where one of the fields is a map that contains HTTP headers (and their values). I want to create a set of all possible keys (in my dataset) for a HTTP header and count how many times I've seen them. Ideally, something like: A = LOAD ... B = FOREACH A GENERATE KEYS(http_headers) C = GROUP FLATTEN(B) BY $0 D = FOREACH C GENERATE group, COUNT($0) (didn't test it but it illustrates the idea..) How do I do something like this? If I

Pig: efficient filtering by loaded list

眉间皱痕 提交于 2019-12-08 05:12:36
In Apache Pig (version 0.16.x), what are some of the most efficient methods to filter a dataset by an existing list of values for one of the dataset's fields? For example, (Updated per @inquisitive_mind's tip) Input: a line-separated file with one value per line my_codes.txt '110' '100' '000' sample_data.txt '110', 2 '110', 3 '001', 3 '000', 1 Desired Output '110', 2 '110', 3 '000', 1 Sample script %default my_codes_file 'my_codes.txt' %default sample_data_file 'sample_data.txt' my_codes = LOAD '$my_codes_file' as (code:chararray) sample_data = LOAD '$sample_data_file' as (code: chararray,

Expand an array with Apache Pig

狂风中的少年 提交于 2019-12-08 04:11:08
问题 I'm analyzing data with Apache pig and could not find a way to expand an array if items. Here is the schema I'm working with, and an example of the desired output: (col1:int, col2:int, items:{ARRAY_ELEM:(name:chararray, total:int)}) input = (1, 1, {("bird", 5), ("bear", 12), ("wolf", 10)}) output = (1, 1, "bird", 5, "bear", 12, "wolf", 10) Is there any way to do this transformation? Thanks for your help! 回答1: If you need to do this transformation right now the easiest way is probably to do a

Generate multiple outputs with Hadoop Pig

橙三吉。 提交于 2019-12-08 03:48:56
问题 I've got this file containing a list of data in Hadoop. I've build a simple Pig script which analyze the file by the id number , and so on... The last step I'm looking for is this: I'd like to to create (store) a file for each unique id number . So this should depend on a group step...however, I haven't understood if this is possible (maybe there is a custom store module?). Any idea? Thanks Daniele 回答1: While keeping in mind what is said by frail, MultiStorage, in PiggyBank, seems to be what

Load xlsx file into Pig

心已入冬 提交于 2019-12-08 03:18:55
问题 Is there any way to load .xlsx files into Pig? I need to perform an operation in PIG using the excel file [.xlsx] as input, but i couldn't find any built-in functions available for this purpose.? Any help to achieve this would be appreciable. Thanks, 回答1: Try this, First convert the xlsx file into csv then do the following, REGISTER Location\to\piggybank.jar Data = load 'Location\to\csv\file' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT