apache-pig

Pig default JsonLoader schema issue

老子叫甜甜 提交于 2020-01-11 11:29:08
问题 I've the below data that need to be parsed using Pig Data { "Name": "BBQ Chicken", "Sizes": [ { "Size": "Large", "Price": 14.99 }, { "Size": "Medium", "Price": 12.99 } ], "Toppings": [ "Barbecue Sauce", "Chicken", "Cheese" ] } I am able to define the schema for Name and Sizes but I couldn't get the Toppings working. Looking for some help here. Script data = LOAD '/user/hue/data/nested_json_pizza_sample_data.json' USING JsonLoader('Name:chararray, Sizes:bag{tuple(Size:chararray, Price:float)},

Invoke Pig Latin script from other Pig script

五迷三道 提交于 2020-01-11 09:34:33
问题 I have a question about PIG Latin. Is there any way how to invoke some pig script from the other pig script? I know it is possible to run user defined functions (UDFs) like: REGISTER myudfs.jar; A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name); DUMP B; But it is not working for pig script. We are counting some different customer parameters and for readibility and reuse it would be great to load some pig snippets, something like:

Building Apache Pig for Hadoop 2.4 version

不想你离开。 提交于 2020-01-11 07:19:28
问题 I downloaded PIG 0.14 and did an ant -dhadoopversion=23 jar , but when I used it on Hadoop 2.4 its not working. Is there anything I should do other than just running ant? Pig is running, but showing errors ang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected thanks ! 回答1: If you checked out Pig from SVN trunk you can verify the Hadoop version it uses at $PIG_HOME/ivy/libraries.properties . For the "23" profile it is 2.4.0 . After you

Apache Pig permissions issue

穿精又带淫゛_ 提交于 2020-01-10 03:00:08
问题 I'm attempting to get Apache Pig up and running on my Hadoop cluster, and am encountering a permissions problem. Pig itself is launching and connecting to the cluster just fine- from within the Pig shell, I can ls through and around my HDFS directories. However, when I try and actually load data and run Pig commands, I run into permissions-related errors: grunt> A = load 'all_annotated.txt' USING PigStorage() AS (id:long, text:chararray, lang:chararray); grunt> DUMP A; 2011-08-24 18:11:40,961

Submit pig job from oozie

戏子无情 提交于 2020-01-07 08:53:26
问题 I am working on automating Pig jobs using oozie in hadoop cluster. I was able to run a sample pig script from oozie but my next requirement is to run a pig job where the pig script recieves it's input parameters from a shell script. Please share your thoughts 回答1: UPDATE: OK make the original question clear, how can you pass a parameter form a shell script output. Here's the working example: WORKFLOW.XML <workflow-app xmlns='uri:oozie:workflow:0.3' name='shell-wf'> <start to='shell1' />

In pig How to concatenate all items in bag ?

安稳与你 提交于 2020-01-07 04:40:14
问题 I have a structure like {A, {1,2,3}} {B, {4,5,6}} What I want is {A, "1|2|3"} {B, "4|5|6"} I looked at CONCAT operator but that will not help me achieve what I wanted. 回答1: This is most easily achieved with a Python UDF. myudfs.py #!/usr/bin/python @outputSchema('concated: string') def concat_bag(BAG): return '|'.join([ str(i) for i in BAG ]) It can be used like: Register 'myudfs.py' using jython as myfuncs; -- Schema of A is: A:{ T:(letter: chararray, B_of_nums: {num: int}) } B = FOREACH A

Group by X OR Y in Pig

假如想象 提交于 2020-01-06 23:43:47
问题 I am processing a big amount of data with Pig and I need to group records by one field OR another. Be careful that it is not classic GROUP BY X AND Y , I mean, you have to group two records if they have the same value for the attributes X OR Y. For example, given this dataset: 1, a, 'r1' 2, b, 'r2' 3, c, 'r3' 4, a, 'r4' 3, d, 'r5' 5, c, 'r6' 5, e, 'r7' The result of grouping by first OR second field should be: {(1, a, 'r1'), (4, a, 'r4')} {(2, b, 'r2')} {(3, c, 'r3'), (3, d, 'r5'), (5, c, 'r6

PLSQL to PIG Conversion

喜夏-厌秋 提交于 2020-01-06 19:41:29
问题 Select x,y,.., CASE( when A.A_code = 'G' THEN COUNT(DISTINCT CASE WHEN T.trxn = 'P' THEN D.A || D.B || D.C ELSE NULL END) else 0 end p_count,..From.. is my plsql query structure. I need to convert it. I have converted the inner case query success fully and it is executed, the inner query of the plsql case condition will become in PIG as T = LOAD '//transaction_types' USING PigStorage(',') as (id:int,trxn:chararray); D = LOAD '/home/sterlingpc1/Desktop/det_trades' USING PigStorage(',') as (id

Pig 0.7.0 ERROR 2118: Unable to create input splits on Hadoop 1.2.1

蓝咒 提交于 2020-01-06 13:51:55
问题 I got output file(stored on HDFS) from map reduce program. now I am trying to load that file using PIG 0.7.0. I am getting following error. I have tried copying this file to local machine and ran pig in local mode which works fine. but I want to skip this step and make it work from map reduce mode. options I tried: LOAD 'file://log/part-00000', LOAD '/log/part-00000', LOAD 'hdfs:/log/part-00000', LOAD 'hdfs://localhost:50070/log/part-00000', hadoop dfs -ls /log/ Warning: $HADOOP_HOME is

Pig 0.7.0 ERROR 2118: Unable to create input splits on Hadoop 1.2.1

人盡茶涼 提交于 2020-01-06 13:51:17
问题 I got output file(stored on HDFS) from map reduce program. now I am trying to load that file using PIG 0.7.0. I am getting following error. I have tried copying this file to local machine and ran pig in local mode which works fine. but I want to skip this step and make it work from map reduce mode. options I tried: LOAD 'file://log/part-00000', LOAD '/log/part-00000', LOAD 'hdfs:/log/part-00000', LOAD 'hdfs://localhost:50070/log/part-00000', hadoop dfs -ls /log/ Warning: $HADOOP_HOME is