apache-pig | 易学教程

Pig Changing Schema to required type

阅读更多关于 Pig Changing Schema to required type

问题 I'm a new Pig user. I have an existing schema which I want to modify. My source data is as follows with 6 columns: Name Type Date Region Op Value ----------------------------------------------------- john ab 20130106 D X 20 john ab 20130106 D C 19 jphn ab 20130106 D T 8 jphn ab 20130106 E C 854 jphn ab 20130106 E T 67 jphn ab 20130106 E X 98 and so on. Each Op value is always C , T or X . I basically want to split my data in the following way into 7 columns: Name Type Date Region OpX OpC OpT

Error 1121 importing external library in Pig UDF in Jython

阅读更多关于 Error 1121 importing external library in Pig UDF in Jython

问题 I'm having a problem using the python library simplejson in jython to write a Pig UDF. I need because jython-standalone-2.5.2.jar doesn't come with a JSON library. I'm using Apache Pig version 0.11.0-cdh4.4.0 (rexported) compiled Sep 03 2013, 20:25:46, and according to the documentation http://pig.apache.org/docs/r0.11.1/udf.html#python-advanced "You can import Python modules in your Python script. Pig resolves Python dependencies recursively, which means Pig will automatically ship all

Error 1121 importing external library in Pig UDF in Jython

阅读更多关于 Error 1121 importing external library in Pig UDF in Jython

ERROR [main] 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException

阅读更多关于 ERROR [main] 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException

问题 I created the the below script in pig. I am pretty new to PIG and PIGLATIN. I am still learning how to use PIG scripts efficiently. Upon executing the script I got this error: Error ERROR [main] org.apache.pig.tools.grunt.Grunt - ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException Can somebody please explain the reason and how I can correct it. In the csv file I have all char columns except the rate column which has integer values.

With Apache Pig how to select and store columns from a CSV according to header line

阅读更多关于 With Apache Pig how to select and store columns from a CSV according to header line

问题 I have many CSV files, all with a header line. The files all look similar : name, gender, preference, .... peter, m, soap, ... paul, m, gel, ... mary, f, soap, ... . . . But column positions and exact header names can be a bit different, eg. another file could look like: "the preferences", "the name", "the gender",.... soap, peter, m, ... gel, paul, m, ... soap, mary, f, ... . . . I want to output/store only the columns for which the header contains the word " name ". The psotion of this

Apache Pig, Suppress “Output Location Validation Failed” “Output directory … already exists”

阅读更多关于 Apache Pig, Suppress “Output Location Validation Failed” “Output directory … already exists”

问题 After getting help from orangeoctopus with this question, I now need to suppress the message "Output Location Validation Failed" "Output directory ... already exists". I know the directory exists, I want it that way. I am pretty sure this will be a matter of overriding something in my Storage UDF, but I am having trouble figuring out what. Totally new to Java so bear with me. Thanks in advance. 回答1: as far as i know, you cannot reuse a direct output directory. Hadoop prevents it. if i

Check if an element is present in a bag?

阅读更多关于 Check if an element is present in a bag?

问题 How can I check in piglatin, if a bag contains an element? Example : In a bag of chararray, how can I check if a token is present? 回答1: In Apache Pig you can use statements nested in FOREACH see Pig Basics. Here is example from the documentation: A is a bag in B . X = FOREACH B { S = FILTER A BY 'xyz'; GENERATE COUNT (S.$0); } Instead of COUNT you can use IsEmpty and ?: operator X = FOREACH B { S = FILTER A BY 'xyz'; GENERATE (IsEmpty(S.$0)) ? 'xyz NOT PRESENT' : 'xyz PRESENT') as present, B;

Generating binary variables in Pig\R

阅读更多关于 Generating binary variables in Pig\R

问题 I am working on the design thought for generating dummy or binary variable in pig script or R script problem: Input to pig script: Any arbitrary relation say as below table A B C a1 b1 c1 a2 b2 c2 a1 b1 c3 suppose we have to generate binary cols based on B,C output should be A B C B.b1 B.b2 C.c1 C.c2 C.c3 a1 b1 c1 1 0 1 0 0 a2 b2 c2 0 1 0 1 0 a1 b1 c3 1 0 0 0 1 I think writing UDF would be right approach on it. However i am not sure as how to define the output schema for the udf as the column

Load JSON array into Pig

阅读更多关于 Load JSON array into Pig

问题 I have a json file with the following format [ { "id": 2, "createdBy": 0, "status": 0, "utcTime": "Oct 14, 2014 4:49:47 PM", "placeName": "21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia", "longitude": 77.5983817, "latitude": 12.9832418, "createdDate": "Sep 16, 2014 2:59:03 PM", "accuracy": 5, "loginType": 1, "mobileNo": "0000005567" }, { "id": 4, "createdBy": 0, "status": 0, "utcTime": "Oct 14, 2014 4:52:48 PM", "placeName": "21/F, Cunningham Main Rd, Sampangi Rama

Pig default JsonLoader schema issue

阅读更多关于 Pig default JsonLoader schema issue

问题 I've the below data that need to be parsed using Pig Data { "Name": "BBQ Chicken", "Sizes": [ { "Size": "Large", "Price": 14.99 }, { "Size": "Medium", "Price": 12.99 } ], "Toppings": [ "Barbecue Sauce", "Chicken", "Cheese" ] } I am able to define the schema for Name and Sizes but I couldn't get the Toppings working. Looking for some help here. Script data = LOAD '/user/hue/data/nested_json_pizza_sample_data.json' USING JsonLoader('Name:chararray, Sizes:bag{tuple(Size:chararray, Price:float)},