apache-pig | 易学教程

I have 50 fields, Is there any option in pig to print first 40 field in Apache Pig? I require something like range $0-$39

阅读更多关于 I have 50 fields, Is there any option in pig to print first 40 field in Apache Pig? I require something like range $0-$39

问题 I have 50 fields, Is there any option in pig to print first 40 fields? I require something like range $0-$39. I don’t want to specify each and every field like $0, $1,$2 etc Giving every column when the number of columns is less is acceptable but when there are a huge number of columns what is the case? 回答1: You can use the .. notation. First 40 fields B = FOREACH A GENERATE $0..$39; All fields B = FOREACH A GENERATE $0..; Multiple ranges,for example first 10,15-20,25-50 B = FOREACH A

Can't connect to Bigtable to scan HTable data due to hardcoded managed=true in hbase client jars

阅读更多关于 Can't connect to Bigtable to scan HTable data due to hardcoded managed=true in hbase client jars

问题 I'm working on a custom load function to load data from Bigtable using Pig on Dataproc. I compile my java code using the following list of jar files I grabbed from Dataproc. When I run the following Pig script, it fails when it tries to establish a connection with Bigtable. Error message is: Bigtable does not support managed connections. Questions: Is there a work around for this problem? Is this a known issue and is there a plan to fix or adjust? Is there a different way of implementing

Java UDF Date Regex Extractor for Pig?

阅读更多关于 Java UDF Date Regex Extractor for Pig?

问题 I am trying to create a UDF for importing into Pig that matches a Regex pattern on a date. The Regex has been tested and works accordingly, but I am having trouble with the following code: package com.date.format; import java.io.IOException; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class DATERANGE extends EvalFunc<String> { @Override public String exec(Tuple arg0) throws IOException { try { String

Selecting random tuple from bag

阅读更多关于 Selecting random tuple from bag

问题 Is it possible to (efficiently) select a random tuple from a bag in pig? I can just take the first result of a bag (as it is unordered), but in my case I need a proper random selection. One (not efficient) solution is counting the number of tuples in the bag, take a random number within that range, loop through the bag, and stop whenever the number of iterations matches my random number. Does anyone know of faster/better ways to do this? 回答1: You could use RANDOM(), ORDER and LIMIT in a

Error using CSVLoader from piggybank

阅读更多关于 Error using CSVLoader from piggybank

问题 I am trying to use CSVLoader from Piggybank. Below are the first two lines of my code: register 'piggybank.jar' ; define CSVLoader org.apache.pig.piggybank.storage.CSVLoader(); It throws the following error: 2013-10-24 14:26:51,427 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2013-10-24 14:26:52,029 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve org.apache.pig.piggybank.storage

Join table by string matching in Hive or Impala or Pig

阅读更多关于 Join table by string matching in Hive or Impala or Pig

问题 I have two tables A and B , where B is huge (20 million by 300) and A is of moderate size (300k by 10). A contains one column that is address and B contains 3 columns that can be put together to form a proper street address. For example, in A , the address column could be: id | Address ----------- 233 | 123 Main St and in B we could have: Number | Street_name | Street_suffix | Tax ------------------------------------------------ 123 | Main | Street | 320.2 I want to join them using string

how to cluster users based on tags

阅读更多关于 how to cluster users based on tags

问题 I'd like to cluster users based on the categories or tags of shows they watch. What's the easiest/best algorithm to do this? Assuming I have around 20,000 tags and several million watch events I can use as signals, is there an algorithm I can implement using say pig/hadoop/mortar or perhaps on neo4j? In terms of data I have users, programs they've watched, and the tags that a program has (usually around 10 tags per program). I would like to expect at the end k number of clusters (maybe a

Pig referencing

阅读更多关于 Pig referencing

问题 I am learning Hadoop pig and I always stuck at referencing the elements.please find the below example. groupwordcount: {group: chararray,words: {(bag_of_tokenTuples_from_line::token: chararray)}} Can somebody please explain how to reference the elements if we have nested tuples and bags. Any Links for better understanding the nested referrencing would be great help. 回答1: Let's do a simple Demonstration to understand this problem. say a file 'a.txt' stored at '/tmp/a.txt' folder in HDFS A =

Is it possible to pass the value of a parameter to an UDF constructor?

阅读更多关于 Is it possible to pass the value of a parameter to an UDF constructor?

问题 I've written a UDF which takes a constructor parameter. I've successfully initialized and used it in grunt as grunt> register mylib.jar grunt> define Function com.company.pig.udf.MyFunction('param-value'); But I can't initialize it using a Pig parameter as in grunt> define Decrypt com.company.pig.udf.MyFunction($secret); or grunt> define Decrypt com.company.pig.udf.MyFunction('$secret'); I tried to initialize $secret using both -param and -param_file options. Are Pig parameters supported as

PIG UDF in JAVA ERROR 1070

阅读更多关于 PIG UDF in JAVA ERROR 1070

问题 I have created UDF_UPPER.jar file in /home/GED385/pigScripts . [GED385@snshadoope1 pigScripts]$ jar tf /home/GED385/pigScripts/UDF_UPPER.jar | grep UPPER UPPER.class But while executing the pig i am getting below error. grunt> exec digital_web_trkg_9.pig 2012-11-30 00:15:32,027 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve UDF_UPPER.UPPER using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] Details at logfile: /data/1/GED385/pigScripts