apache-pig | 易学教程

Apache PIG, ELEPHANTBIRDJSON Loader

阅读更多关于 Apache PIG, ELEPHANTBIRDJSON Loader

问题 I'm trying to parse below input (there are 2 records in this input)using Elephantbird json loader [{"node_disk_lnum_1":36,"node_disk_xfers_in_rate_sum":136.40000000000001,"node_disk_bytes_in_rate_22": 187392.0, "node_disk_lnum_7": 13}] [{"node_disk_lnum_1": 36, "node_disk_xfers_in_rate_sum": 105.2,"node_disk_bytes_in_rate_22": 123084.8, "node_disk_lnum_7":13}] Here is my syntax: register '/home/data/Desktop/elephant-bird-pig-4.1.jar'; a = LOAD '/pig/tc1.log' USING com.twitter.elephantbird.pig

Applying TRIM() in Pig for all fields in a tuple

阅读更多关于 Applying TRIM() in Pig for all fields in a tuple

问题 I am loading a CSV file with 56 fields. I want to apply TRIM() function in Pig for all fields in the tuple. I tried: B = FOREACH A GENERATE TRIM(*); But it fails with below error- ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.TRIM as multiple or none of them fit. Please use an explicit cast. Please help. Thank you. 回答1: To Trim a tuple in the Pig, you should create a UDF. Register the UDF and apply the UDF with Foreach

Pig - RANK Operation on Groups

阅读更多关于 Pig - RANK Operation on Groups

问题 I'm new to Pig and I'm trying to perform RANK operation within group.My data looks like Name address Date A addr1 20150101 A addr2 20150130 B addr1 20140325 B addr2 20140821 B addr3 20150102 I want my output like this Name address Date Rank A addr1 20150101 1 A addr2 20150130 2 B addr1 20140325 1 B addr2 20140821 2 B addr3 20150102 3 I'm using Pig-0.12.1.Is there any way to get the output in required format with pig built-in functions ?? 回答1: It will be little bit difficult to solve this

where exactly PIG stores its relations

阅读更多关于 where exactly PIG stores its relations

问题 i am in a big confusion with the below two statements. 1) where exactly LOAD statement stores this relation(student), is it on hdfs/PIG internal storage/local machine ??? example : student = LOAD 'HDFS:/student' using PigStorage(','); 2) if i try to DUMP student; then it takes almost 30-40 sec to display result where as LOAD statement takes 1-2 sec.... if we are trying to retrieve data from pig internal storage then why is this delay ?? would be grateful if anyone can clear this doubts

Pig - exception on simple load

阅读更多关于 Pig - exception on simple load

问题 I just started learning pig and trying to do something with it, so I enter the pig console and simply type a = load 'sample_data.csv'; ( I have a file named sample_data.csv ). I received the following exception: Pig Stack Trace --------------- ERROR 2998: Unhandled internal error. name java.lang.NoSuchFieldError: name at org.apache.pig.parser.QueryParserStringStream.<init>(QueryParserStringStream.java:32) at org.apache.pig.parser.QueryParserDriver.tokenize(QueryParserDriver.java:207) at org

hadoop pig : join on a condition (ex. tab1.COL1 LIKE (%tab2.col2%) )

阅读更多关于 hadoop pig : join on a condition (ex. tab1.COL1 LIKE (%tab2.col2%) )

问题 How to implement a join on a condition in PIG? SQL equivalent Examples: select * from tab1, tab2 where instr(t1.col1,t2.col1 ) > 1 ; select * from tab1, tab2 where f(t1.col1) =f(t2.col1) ; Thank you very much. Filippo 回答1: As of now pig supports only Inner Joins,Outer Joins and Full Joins. second Join example can be implemented in Pig, not the other one. Below is an example. tab1 = LOAD 'file1' using PigStorage('|') using (col1:chararray,col2:chararray); tab2 = LOAD 'file2' using PigStorage('

Understanding map syntax

阅读更多关于 Understanding map syntax

问题 I have some problems understanding how the map should be used. Following this tutorial I created a file containing the following text: [open#apache] [apache#hadoop] The, I was able to load that file without errors: a = load 'data/file_name.txt' as (M:map []) Now, how can I take the list of all the " values "? I.e. (apache) (hadoop) Furthermore, I have just started to learn Pig, therefore every hints is going to be very helpful. 回答1: There is only one way to interact with a map, and that is to

Passing a list to Javascript UDF in Apache Pig

阅读更多关于 Passing a list to Javascript UDF in Apache Pig

问题 If I have an array of stuff in Pig, like so: datas = load './data.txt' using PigStorage( '\t'); list = load './frobdata.txt' using PigStorage(); And I want to pass these on to a UDF, like so: register './enfrobinate.js' using javascript as frob; frobbed = foreach datas generate flatten( frob.enfrobinate( list, $0 ) ); I cannot seem to find a prototype that works for passing a list to javascript, and the Pig documentation is not real clear on datatypes for Javascript UDFs. I am aware of cross

How to load a file with a JSON array per line in Pig Latin

阅读更多关于 How to load a file with a JSON array per line in Pig Latin

问题 An existing script creates text files with an array of JSON objects per line, e.g., [{"foo":1,"bar":2},{"foo":3,"bar":4}] [{"foo":5,"bar":6},{"foo":7,"bar":8},{"foo":9,"bar":0}] … I would like to load this data in Pig, exploding the arrays and processing each individual object. I have looked at using the JsonLoader in Twitter’s Elephant Bird to no avail. It doesn’t complain about the JSON, but I get “Successfully read 0 records” when running the following: register '/tmp/elephant-bird/core

Getting exception while trying to execute a Pig Latin Script

阅读更多关于 Getting exception while trying to execute a Pig Latin Script

问题 I am learning Pig on my own and while trying to explore a dataset I am encountering an exception. What is wrong in the script and why: movies_data = LOAD '/movies_data' using PigStorage(',') as (id:chararray,title:chararray,year:int,rating:double,duration:double); high = FILTER movies_data by rating > 4.0; high_rated = FOREACH high GENERATE movies_data.title,movies_data.year,movies_data.rating,movies_data.duration; DUMP high_rated; At the end of the MAP Reduce execution I am getting the below