apache-pig

I have 50 fields, Is there any option in pig to print first 40 field in Apache Pig? I require something like range $0-$39

99封情书 提交于 2019-12-11 07:08:56
问题 I have 50 fields, Is there any option in pig to print first 40 fields? I require something like range $0-$39. I don’t want to specify each and every field like $0, $1,$2 etc Giving every column when the number of columns is less is acceptable but when there are a huge number of columns what is the case? 回答1: You can use the .. notation. First 40 fields B = FOREACH A GENERATE $0..$39; All fields B = FOREACH A GENERATE $0..; Multiple ranges,for example first 10,15-20,25-50 B = FOREACH A

Can't connect to Bigtable to scan HTable data due to hardcoded managed=true in hbase client jars

时光毁灭记忆、已成空白 提交于 2019-12-11 06:37:43
问题 I'm working on a custom load function to load data from Bigtable using Pig on Dataproc. I compile my java code using the following list of jar files I grabbed from Dataproc. When I run the following Pig script, it fails when it tries to establish a connection with Bigtable. Error message is: Bigtable does not support managed connections. Questions: Is there a work around for this problem? Is this a known issue and is there a plan to fix or adjust? Is there a different way of implementing

Java UDF Date Regex Extractor for Pig?

岁酱吖の 提交于 2019-12-11 06:14:56
问题 I am trying to create a UDF for importing into Pig that matches a Regex pattern on a date. The Regex has been tested and works accordingly, but I am having trouble with the following code: package com.date.format; import java.io.IOException; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class DATERANGE extends EvalFunc<String> { @Override public String exec(Tuple arg0) throws IOException { try { String

Selecting random tuple from bag

依然范特西╮ 提交于 2019-12-11 05:55:35
问题 Is it possible to (efficiently) select a random tuple from a bag in pig? I can just take the first result of a bag (as it is unordered), but in my case I need a proper random selection. One (not efficient) solution is counting the number of tuples in the bag, take a random number within that range, loop through the bag, and stop whenever the number of iterations matches my random number. Does anyone know of faster/better ways to do this? 回答1: You could use RANDOM(), ORDER and LIMIT in a

Error using CSVLoader from piggybank

无人久伴 提交于 2019-12-11 05:04:46
问题 I am trying to use CSVLoader from Piggybank. Below are the first two lines of my code: register 'piggybank.jar' ; define CSVLoader org.apache.pig.piggybank.storage.CSVLoader(); It throws the following error: 2013-10-24 14:26:51,427 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2013-10-24 14:26:52,029 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve org.apache.pig.piggybank.storage

Join table by string matching in Hive or Impala or Pig

我的未来我决定 提交于 2019-12-11 04:36:42
问题 I have two tables A and B , where B is huge (20 million by 300) and A is of moderate size (300k by 10). A contains one column that is address and B contains 3 columns that can be put together to form a proper street address. For example, in A , the address column could be: id | Address ----------- 233 | 123 Main St and in B we could have: Number | Street_name | Street_suffix | Tax ------------------------------------------------ 123 | Main | Street | 320.2 I want to join them using string

how to cluster users based on tags

穿精又带淫゛_ 提交于 2019-12-11 04:11:51
问题 I'd like to cluster users based on the categories or tags of shows they watch. What's the easiest/best algorithm to do this? Assuming I have around 20,000 tags and several million watch events I can use as signals, is there an algorithm I can implement using say pig/hadoop/mortar or perhaps on neo4j? In terms of data I have users, programs they've watched, and the tags that a program has (usually around 10 tags per program). I would like to expect at the end k number of clusters (maybe a

Pig referencing

隐身守侯 提交于 2019-12-11 03:50:48
问题 I am learning Hadoop pig and I always stuck at referencing the elements.please find the below example. groupwordcount: {group: chararray,words: {(bag_of_tokenTuples_from_line::token: chararray)}} Can somebody please explain how to reference the elements if we have nested tuples and bags. Any Links for better understanding the nested referrencing would be great help. 回答1: Let's do a simple Demonstration to understand this problem. say a file 'a.txt' stored at '/tmp/a.txt' folder in HDFS A =

Is it possible to pass the value of a parameter to an UDF constructor?

南楼画角 提交于 2019-12-11 03:39:32
问题 I've written a UDF which takes a constructor parameter. I've successfully initialized and used it in grunt as grunt> register mylib.jar grunt> define Function com.company.pig.udf.MyFunction('param-value'); But I can't initialize it using a Pig parameter as in grunt> define Decrypt com.company.pig.udf.MyFunction($secret); or grunt> define Decrypt com.company.pig.udf.MyFunction('$secret'); I tried to initialize $secret using both -param and -param_file options. Are Pig parameters supported as

PIG UDF in JAVA ERROR 1070

て烟熏妆下的殇ゞ 提交于 2019-12-11 03:27:58
问题 I have created UDF_UPPER.jar file in /home/GED385/pigScripts . [GED385@snshadoope1 pigScripts]$ jar tf /home/GED385/pigScripts/UDF_UPPER.jar | grep UPPER UPPER.class But while executing the pig i am getting below error. grunt> exec digital_web_trkg_9.pig 2012-11-30 00:15:32,027 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve UDF_UPPER.UPPER using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] Details at logfile: /data/1/GED385/pigScripts