apache-pig

How to specify “pig-0.13.0-h2.jar” dependency in build.gradle?

我的梦境 提交于 2019-12-11 11:04:06
问题 To specify a Maven dependency in my project, I provide a name , a group id , and a version . This has been enough for every dependency in my project, save one. Pig has multiple jars in the same artifact (not sure if I have the proper nomenclature; I'm still rather new to Maven), but I only need one. Specifically, I need pig-0.13.0-h2.jar . However, when I provide the dependency compile "org.apache.pig:pig:0.13.0" in my build.gradle , only pig-0.13.0.jar , pig-0.13.0-sources.jar , and pig-0.13

How To Find All Possible Permutations From A Bag under apache pig

三世轮回 提交于 2019-12-11 09:59:52
问题 i'm trying to find all combinations possible using apache pig, i was able to generate permutation but i want to eliminate the replication of values i write this code : A = LOAD 'data' AS f1:chararray; DUMP A; ('A') ('B') ('C') B = FOREACH A GENERATE $0 AS v1; C = FOREACH A GENERATE $0 AS v2; D = CROSS B, C; And the result i obtained is like : DUMP D; ('A', 'A') ('A', 'B') ('A', 'C') ('B', 'A') ('B', 'B') ('B', 'C') ('C', 'A') ('C', 'B') ('C', 'C') but what i'm trying to obtain the result is

MultiStorage in pig

孤者浪人 提交于 2019-12-11 09:59:48
问题 I have run the below pig script in the grunt shell Register D:\Pig\contrib\piggybank\java\piggybank.jar; a = load '/part' using PigStorage(',') as (uuid:chararray,timestamp:chararray,Name:chararray,EmailID:chararray,CompanyName:chararray,Location:chararray); store a into '/output/multistorage' USING MultiStorage('/output/multistorage','2', 'none', ','); while running this it throws error as shown below 2015-11-03 05:47:36,328 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 10 70: Could

Load operations in pig script sequential or parallel?

ぐ巨炮叔叔 提交于 2019-12-11 09:43:55
问题 I have 2 load statements in a pig script as below: a=load 'file1.dat' using HCatLoader(); b=load 'file2.dat' using HcatLoader(); After these, I have some transformations on a and b seperately. If we run this pig script in batch mode, does the load and transformations of both files happen sequentially or in parallel? I was thinking that pig optimises this script and runs both the loads in parallel. But not 100% sure. Can anyone comment on this? 回答1: Each load command will run in parallel, but

Poor performance on hash joins with Pig on Tez

倖福魔咒の 提交于 2019-12-11 08:54:12
问题 I have a series of Pig scripts that are transforming hundreds of millions of records from multiple data sources that need to be joined together. Towards the end of each script, I reach a point where JOIN performance becomes terribly slow. Looking at the DAG in the Tez View, I see that it is split into relatively few tasks (typically 100-200), but each task takes multiple hours to complete. The task description shows that it's doing a HASH_JOIN. Interestingly, I only run into this bottleneck

Apache Pig: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

蓝咒 提交于 2019-12-11 08:17:31
问题 I'm running Pig 15 and am trying to group data here. I'm running into a Requested array size exceeds VM limit error. The file size is pretty small and takes just 10 mappers of 2.5G each to run with no memory errors. Below shown is a snippet of what I'm doing: sample_set = LOAD 's3n://<bucket>/<dev_dir>/000*-part.gz' USING PigStorage(',') AS (col1:chararray,col2:chararray..col23:chararray); sample_set_group_by_col1 = GROUP sample_set BY col1; sample_set_group_by_col1_10 = LIMIT sample_set

Is compression/decompression of gzip data transparent in Hadoop/PIG?

天涯浪子 提交于 2019-12-11 08:14:38
问题 I read somewhere that Hadoop has a built-in support for compression and decompression but I guess it is about mapper output (by setting some properties)? I wonder if there is any particular PIG load/store functions I can use for reading compressed data or outputting data as compressed? 回答1: The PigStorage handles compressed input by examining the file names: *.bz2 / *.bz - org.apache.pig.bzip2r.Bzip2TextInputFormat Everything else uses org.apache.pig.backend.hadoop.executionengine

How to flatten recursive hierarchy using Hive/Pig/MapReduce

a 夏天 提交于 2019-12-11 08:01:11
问题 I have unbalanced tree data stored in tabular format like: parent,child a,b b,c c,d c,f f,g The depth of tree is unknow. how to flatten this hierarchy where each row contains entire path from leaf node to root node in a row as: leaf node, root node, intermediate nodes d,a,d:c:b f,a,e:b Any suggestions to solve above problem using hive, pig or mapreduce? Thanks in advance. 回答1: I tried to solve it using pig, here are the sample code: Join function: -- Join parent and child Define join

What is a job history server in Hadoop and why is it mandatory to start the history server before starting Pig in Map Reduce mode?

一曲冷凌霜 提交于 2019-12-11 07:50:51
问题 Before starting Pig in map reduce mode you always have to start the history server else while trying to execute Pig Latin statements the below mentioned logs are generated: 2018-10-18 15:59:13,709 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. **Redirecting to job history server** 2018-10-18 15:59:14,713 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 0

Datastax Cassandra PIG Running only one MAP

我只是一个虾纸丫 提交于 2019-12-11 07:25:17
问题 I am using Datastax Cassandra 3.1.4 with two nodes. I am running pig with CqlStorage() with 12million rows in the table, but I find there is only one map running for a simple pig command. I tried changing split_size in my pig relation but it didn't worked. Here is my sample query. x = load'cql://Mykeyspace/MyCF?split_size=1000' using CqlStorage(); y = limit x 500; dump y I didn't find input.split.size property in my mapred-site.xml I am assuming default split size is 64*1024 I tried set pig