apache-pig | 易学教程

ElephantBird package build failure:

阅读更多关于 ElephantBird package build failure:

问题 I downloaded ElephantBird source and tried to build by running "mvn package" but I am getting the following error: [ERROR] Failed to execute goal com.github.igor-petruk.protobuf:protobuf-maven-plugin:0.4:run (default) on project elephant-bird-core: Unable to find 'protoc' -> [Help 1] I am using mvn version 3.0.3 and I tried in the Mac and Ubuntu but I got the same error. EDIT1: Thanks to Lorand's comments, I resolved the above problem by upgrading the protocol buffer. I also installed Thrift

ElephantBird package build failure:

阅读更多关于 ElephantBird package build failure:

apache Pig trying to get max count in each group

阅读更多关于 apache Pig trying to get max count in each group

问题 I have data of format in pig {(group, productId, count)} . Now I want to get maximum count in each group and the output might look as follows {(group, productId, maxCount)} . Here is the sample input data (south America,prod1, 45),(south America,prod2, 36), (latin america, prod1, 48),(latin america, prod5,35) here is the output for this input look like (south america, prod1, 45) (North America, prod2, 36) (latin america, prod1, 48) can someone help me on this. 回答1: Based on your sample input

Apache Pig not parsing a tuple fully

阅读更多关于 Apache Pig not parsing a tuple fully

问题 I have a file called data that looks like this: (note there are tabs after the 'personA') personA (1, 2, 3) personB (2, 1, 34) And I have an Apache pig script like this: A = LOAD 'data' AS (name: chararray, nodes: tuple(a:int, b:int, c:int)); C = foreach A generate nodes.$0; dump C; The output of which makes sense: (1) (2) However if I change the schema of the script to be like this: A = LOAD 'data' AS (name: chararray, nodes: tuple()); C = foreach A generate nodes.$0; dump C; Then the output

How to turn (A, B, C) into (AB, AC, BC) with Pig?

阅读更多关于 How to turn (A, B, C) into (AB, AC, BC) with Pig?

问题 In Pig, given the following Bag: (A, B, C), can I somehow calculate the unique combinations of all the values? The result I'm looking for is something like (AB, AC, BC). I'm disregarding BA, CA, CB since they would become duplicates of the existing values if sorted in alphabetic order. 回答1: The only way of doing something like that is writing a UDF. This one will do exactly what you want: public class CombinationsUDF extends EvalFunc<DataBag> { public DataBag exec(Tuple input) throws

Writing one file per group in Pig Latin

阅读更多关于 Writing one file per group in Pig Latin

问题 The Problem: I have numerous files that contain Apache web server log entries. Those entries are not in date time order and are scattered across the files. I am trying to use Pig to read a day's worth of files, group and order the log entries by date time, then write them to files named for the day and hour of the entries it contains. Setup: Once I have imported my files, I am using Regex to get the date field, then I am truncating it to hour. This produces a set that has the record in one

Where to see the mapreduce code generated from hadoop pig statements

阅读更多关于 Where to see the mapreduce code generated from hadoop pig statements

问题 We all know that the hadoop pig statements are converted into java mapreduce code. I want to know there is any way i can see the mapreduce code generated from pig statements ? 回答1: We all know that hadoop pig statements are converted into java mapreduce code This is not the case. Hadoop Pig statements are not translated into Java MapReduce code. A better way of thinking about it is Pig code is "interpreted" in an Pig interpreter that runs in Java MapReduce. Think about it this way: Python and

Type conversion pig hcatalog

阅读更多关于 Type conversion pig hcatalog

问题 I use HCatalog version 0.4. I have a table in hive 'abc' which has a column with datatype 'timestamp'. When i try to run a pig script like this "raw_data = load 'abc' using org.apache.hcatalog.pig.HCatLoader();" i get an error saying "java.lang.TypeNotPresentException: Type timestamp not present". 回答1: The problem is that hcatalog doesn’t support timestamp type. It will be supported under hive 0.13, they have an issue about this problem that was already solved, you can see the issue in https:

Why does this Pig UDF Result in an “Error: Java heap space” Given that I am Spilling the DataBag to Disk?

阅读更多关于 Why does this Pig UDF Result in an “Error: Java heap space” Given that I am Spilling the DataBag to Disk?

问题 Here is my UDF: public DataBag exec(Tuple input) throws IOException { Aggregate aggregatedOutput = null; int spillCount = 0; DataBag outputBag = BagFactory.newDefaultBag(); DataBag values = (DataBag)input.get(0); for (Iterator<Tuple> iterator = values.iterator(); iterator.hasNext();) { Tuple tuple = iterator.next(); //spillCount++; ... if (some condition regarding current input tuple){ //do something to aggregatedOutput with information from input tuple } else { //Because input tuple does not

Equivalent of linux 'diff' in Apache Pig

阅读更多关于 Equivalent of linux 'diff' in Apache Pig

问题 I want to be able to do a standard diff on two large files. I've got something that will work but it's not nearly as quick as diff on the command line. A = load 'A' as (line); B = load 'B' as (line); JOINED = join A by line full outer, B by line; DIFF = FILTER JOINED by A::line is null or B::line is null; DIFF2 = FOREACH DIFF GENERATE (A::line is null?B::line : A::line), (A::line is null?'REMOVED':'ADDED'); STORE DIFF2 into 'diff'; Anyone got any better ways to do this? 回答1: I use the