apache-pig | 易学教程

Pig default JsonLoader schema issue

阅读更多关于 Pig default JsonLoader schema issue

问题 I've the below data that need to be parsed using Pig Data { "Name": "BBQ Chicken", "Sizes": [ { "Size": "Large", "Price": 14.99 }, { "Size": "Medium", "Price": 12.99 } ], "Toppings": [ "Barbecue Sauce", "Chicken", "Cheese" ] } I am able to define the schema for Name and Sizes but I couldn't get the Toppings working. Looking for some help here. Script data = LOAD '/user/hue/data/nested_json_pizza_sample_data.json' USING JsonLoader('Name:chararray, Sizes:bag{tuple(Size:chararray, Price:float)},

Invoke Pig Latin script from other Pig script

阅读更多关于 Invoke Pig Latin script from other Pig script

问题 I have a question about PIG Latin. Is there any way how to invoke some pig script from the other pig script? I know it is possible to run user defined functions (UDFs) like: REGISTER myudfs.jar; A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name); DUMP B; But it is not working for pig script. We are counting some different customer parameters and for readibility and reuse it would be great to load some pig snippets, something like:

Building Apache Pig for Hadoop 2.4 version

阅读更多关于 Building Apache Pig for Hadoop 2.4 version

问题 I downloaded PIG 0.14 and did an ant -dhadoopversion=23 jar , but when I used it on Hadoop 2.4 its not working. Is there anything I should do other than just running ant? Pig is running, but showing errors ang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected thanks ! 回答1: If you checked out Pig from SVN trunk you can verify the Hadoop version it uses at $PIG_HOME/ivy/libraries.properties . For the "23" profile it is 2.4.0 . After you

Apache Pig permissions issue

阅读更多关于 Apache Pig permissions issue

问题 I'm attempting to get Apache Pig up and running on my Hadoop cluster, and am encountering a permissions problem. Pig itself is launching and connecting to the cluster just fine- from within the Pig shell, I can ls through and around my HDFS directories. However, when I try and actually load data and run Pig commands, I run into permissions-related errors: grunt> A = load 'all_annotated.txt' USING PigStorage() AS (id:long, text:chararray, lang:chararray); grunt> DUMP A; 2011-08-24 18:11:40,961

Submit pig job from oozie

阅读更多关于 Submit pig job from oozie

问题 I am working on automating Pig jobs using oozie in hadoop cluster. I was able to run a sample pig script from oozie but my next requirement is to run a pig job where the pig script recieves it's input parameters from a shell script. Please share your thoughts 回答1: UPDATE: OK make the original question clear, how can you pass a parameter form a shell script output. Here's the working example: WORKFLOW.XML <workflow-app xmlns='uri:oozie:workflow:0.3' name='shell-wf'> <start to='shell1' />

In pig How to concatenate all items in bag ?

阅读更多关于 In pig How to concatenate all items in bag ?

问题 I have a structure like {A, {1,2,3}} {B, {4,5,6}} What I want is {A, "1|2|3"} {B, "4|5|6"} I looked at CONCAT operator but that will not help me achieve what I wanted. 回答1: This is most easily achieved with a Python UDF. myudfs.py #!/usr/bin/python @outputSchema('concated: string') def concat_bag(BAG): return '|'.join([ str(i) for i in BAG ]) It can be used like: Register 'myudfs.py' using jython as myfuncs; -- Schema of A is: A:{ T:(letter: chararray, B_of_nums: {num: int}) } B = FOREACH A

Group by X OR Y in Pig

阅读更多关于 Group by X OR Y in Pig

问题 I am processing a big amount of data with Pig and I need to group records by one field OR another. Be careful that it is not classic GROUP BY X AND Y , I mean, you have to group two records if they have the same value for the attributes X OR Y. For example, given this dataset: 1, a, 'r1' 2, b, 'r2' 3, c, 'r3' 4, a, 'r4' 3, d, 'r5' 5, c, 'r6' 5, e, 'r7' The result of grouping by first OR second field should be: {(1, a, 'r1'), (4, a, 'r4')} {(2, b, 'r2')} {(3, c, 'r3'), (3, d, 'r5'), (5, c, 'r6

PLSQL to PIG Conversion

阅读更多关于 PLSQL to PIG Conversion

问题 Select x,y,.., CASE( when A.A_code = 'G' THEN COUNT(DISTINCT CASE WHEN T.trxn = 'P' THEN D.A || D.B || D.C ELSE NULL END) else 0 end p_count,..From.. is my plsql query structure. I need to convert it. I have converted the inner case query success fully and it is executed, the inner query of the plsql case condition will become in PIG as T = LOAD '//transaction_types' USING PigStorage(',') as (id:int,trxn:chararray); D = LOAD '/home/sterlingpc1/Desktop/det_trades' USING PigStorage(',') as (id

Pig 0.7.0 ERROR 2118: Unable to create input splits on Hadoop 1.2.1

阅读更多关于 Pig 0.7.0 ERROR 2118: Unable to create input splits on Hadoop 1.2.1

问题 I got output file(stored on HDFS) from map reduce program. now I am trying to load that file using PIG 0.7.0. I am getting following error. I have tried copying this file to local machine and ran pig in local mode which works fine. but I want to skip this step and make it work from map reduce mode. options I tried: LOAD 'file://log/part-00000', LOAD '/log/part-00000', LOAD 'hdfs:/log/part-00000', LOAD 'hdfs://localhost:50070/log/part-00000', hadoop dfs -ls /log/ Warning: $HADOOP_HOME is

Pig 0.7.0 ERROR 2118: Unable to create input splits on Hadoop 1.2.1

阅读更多关于 Pig 0.7.0 ERROR 2118: Unable to create input splits on Hadoop 1.2.1