apache-pig

Pig Changing Schema to required type

谁说胖子不能爱 提交于 2020-01-16 00:11:04
问题 I'm a new Pig user. I have an existing schema which I want to modify. My source data is as follows with 6 columns: Name Type Date Region Op Value ----------------------------------------------------- john ab 20130106 D X 20 john ab 20130106 D C 19 jphn ab 20130106 D T 8 jphn ab 20130106 E C 854 jphn ab 20130106 E T 67 jphn ab 20130106 E X 98 and so on. Each Op value is always C , T or X . I basically want to split my data in the following way into 7 columns: Name Type Date Region OpX OpC OpT

Error 1121 importing external library in Pig UDF in Jython

六月ゝ 毕业季﹏ 提交于 2020-01-15 19:17:03
问题 I'm having a problem using the python library simplejson in jython to write a Pig UDF. I need because jython-standalone-2.5.2.jar doesn't come with a JSON library. I'm using Apache Pig version 0.11.0-cdh4.4.0 (rexported) compiled Sep 03 2013, 20:25:46, and according to the documentation http://pig.apache.org/docs/r0.11.1/udf.html#python-advanced "You can import Python modules in your Python script. Pig resolves Python dependencies recursively, which means Pig will automatically ship all

Error 1121 importing external library in Pig UDF in Jython

ⅰ亾dé卋堺 提交于 2020-01-15 19:16:51
问题 I'm having a problem using the python library simplejson in jython to write a Pig UDF. I need because jython-standalone-2.5.2.jar doesn't come with a JSON library. I'm using Apache Pig version 0.11.0-cdh4.4.0 (rexported) compiled Sep 03 2013, 20:25:46, and according to the documentation http://pig.apache.org/docs/r0.11.1/udf.html#python-advanced "You can import Python modules in your Python script. Pig resolves Python dependencies recursively, which means Pig will automatically ship all

ERROR [main] 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException

时光总嘲笑我的痴心妄想 提交于 2020-01-15 11:33:33
问题 I created the the below script in pig. I am pretty new to PIG and PIGLATIN. I am still learning how to use PIG scripts efficiently. Upon executing the script I got this error: Error ERROR [main] org.apache.pig.tools.grunt.Grunt - ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException Can somebody please explain the reason and how I can correct it. In the csv file I have all char columns except the rate column which has integer values.

With Apache Pig how to select and store columns from a CSV according to header line

让人想犯罪 __ 提交于 2020-01-15 09:14:11
问题 I have many CSV files, all with a header line. The files all look similar : name, gender, preference, .... peter, m, soap, ... paul, m, gel, ... mary, f, soap, ... . . . But column positions and exact header names can be a bit different, eg. another file could look like: "the preferences", "the name", "the gender",.... soap, peter, m, ... gel, paul, m, ... soap, mary, f, ... . . . I want to output/store only the columns for which the header contains the word " name ". The psotion of this

Apache Pig, Suppress “Output Location Validation Failed” “Output directory … already exists”

有些话、适合烂在心里 提交于 2020-01-14 19:14:11
问题 After getting help from orangeoctopus with this question, I now need to suppress the message "Output Location Validation Failed" "Output directory ... already exists". I know the directory exists, I want it that way. I am pretty sure this will be a matter of overriding something in my Storage UDF, but I am having trouble figuring out what. Totally new to Java so bear with me. Thanks in advance. 回答1: as far as i know, you cannot reuse a direct output directory. Hadoop prevents it. if i

Check if an element is present in a bag?

烂漫一生 提交于 2020-01-14 07:32:12
问题 How can I check in piglatin, if a bag contains an element? Example : In a bag of chararray, how can I check if a token is present? 回答1: In Apache Pig you can use statements nested in FOREACH see Pig Basics. Here is example from the documentation: A is a bag in B . X = FOREACH B { S = FILTER A BY 'xyz'; GENERATE COUNT (S.$0); } Instead of COUNT you can use IsEmpty and ?: operator X = FOREACH B { S = FILTER A BY 'xyz'; GENERATE (IsEmpty(S.$0)) ? 'xyz NOT PRESENT' : 'xyz PRESENT') as present, B;

Generating binary variables in Pig\R

谁都会走 提交于 2020-01-14 06:55:12
问题 I am working on the design thought for generating dummy or binary variable in pig script or R script problem: Input to pig script: Any arbitrary relation say as below table A B C a1 b1 c1 a2 b2 c2 a1 b1 c3 suppose we have to generate binary cols based on B,C output should be A B C B.b1 B.b2 C.c1 C.c2 C.c3 a1 b1 c1 1 0 1 0 0 a2 b2 c2 0 1 0 1 0 a1 b1 c3 1 0 0 0 1 I think writing UDF would be right approach on it. However i am not sure as how to define the output schema for the udf as the column

Load JSON array into Pig

白昼怎懂夜的黑 提交于 2020-01-11 12:53:52
问题 I have a json file with the following format [ { "id": 2, "createdBy": 0, "status": 0, "utcTime": "Oct 14, 2014 4:49:47 PM", "placeName": "21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia", "longitude": 77.5983817, "latitude": 12.9832418, "createdDate": "Sep 16, 2014 2:59:03 PM", "accuracy": 5, "loginType": 1, "mobileNo": "0000005567" }, { "id": 4, "createdBy": 0, "status": 0, "utcTime": "Oct 14, 2014 4:52:48 PM", "placeName": "21/F, Cunningham Main Rd, Sampangi Rama

Pig default JsonLoader schema issue

风流意气都作罢 提交于 2020-01-11 11:29:13
问题 I've the below data that need to be parsed using Pig Data { "Name": "BBQ Chicken", "Sizes": [ { "Size": "Large", "Price": 14.99 }, { "Size": "Medium", "Price": 12.99 } ], "Toppings": [ "Barbecue Sauce", "Chicken", "Cheese" ] } I am able to define the schema for Name and Sizes but I couldn't get the Toppings working. Looking for some help here. Script data = LOAD '/user/hue/data/nested_json_pizza_sample_data.json' USING JsonLoader('Name:chararray, Sizes:bag{tuple(Size:chararray, Price:float)},