apache-pig | 易学教程

Hadoop Pig Ordered Analytical Functions

阅读更多关于 Hadoop Pig Ordered Analytical Functions

问题 I am new in Pig and would like to use an ordered analytical function, similar to what is possible in SQL. My data looks something like this: (stock_symbol,date,stock_price_open,stock_price_close) (TAC,2001-08-06,16.39,16.36) (TAC,2001-08-07,16.3,16.54) (TAC,2001-08-08,16.55,16.44) (TAC,2001-08-09,16.45,16.48) (TAC,2001-08-10,16.5,15.8) What I want to do is find the change in opening stock price from day to day. So, my output would look something like this: (stock_symbol,date,stock_price_open

Pig xmlloader error when loading tag with colon

阅读更多关于 Pig xmlloader error when loading tag with colon

问题 Ive been using Pig and XMLLOADER to load xml files. I've been practising on BOOK example. However, XML file I need to process has colons in tag. When I run a script it says that due to ':' it cannot be processed.(exact log at the end) This is the file I have. Modified for the purpose of ":" case. BOOKT.xml <CATALOG> <BC:BOOK id="1"> <TITLE>Hadoop Defnitive Guide</TITLE> <AUTHOR>Tom White</AUTHOR> <COUNTRY>US</COUNTRY> <COMPANY>CLOUDERA</COMPANY> <PRICE>24.90</PRICE> <YEAR>2012</YEAR> </BC

Error 1045 on sum function in pig latin with an int

阅读更多关于 Error 1045 on sum function in pig latin with an int

问题 The following pig latin script: data = load 'access_log_Jul95' using PigStorage(' ') as (ip:chararray, dash1:chararray, dash2:chararray, date:chararray, date1:chararray, getRequset:chararray, location:chararray, http:chararray, code:int, size:int); splitDate = foreach data generate size as size:int , ip as ip, FLATTEN(STRSPLIT(date, ':')) as h; groupedIp = group splitDate by h.$1; a = foreach groupedIp{ added = foreach splitDate generate SUM(size); -- generate added; }; describe a; gives me

Can't load avro schema in pig

阅读更多关于 Can't load avro schema in pig

问题 I have an avro schema, and I am writing data with that schema to an AvroSequenceFileOutputFormat . I looked in the file and can confirm that the schema is there to read. I call the function avro = load 'part-r-00000.avro' using AvroStorage(); and it gives me the error message ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: Cannot get schema from loadFunc org.apache.pig.builtin.AvroStorage Details at logfile: /Users/ajosephs/Code/serialization-protocol/output/pig_1391635368675.log Does

Combining Multiple Maps together in Pig

阅读更多关于 Combining Multiple Maps together in Pig

问题 I am using pig for the first time. I've gotten to the point where I have exactly the answer I want, but in a weirdly nested format: {(price,49),(manages,"1d74426f-2b0a-4777-ac1b-042268cab09c")} I'd like the output to be a single map, without any wrapping: [price#49, manages#"1d74426f-2b0a-4777-ac1b-042268cab09c"] I've managed to use TOMAP to get this far, but I can't figure out how to merge and flatten it away. {([price_specification#{"amount":49,"currency":"USD"}]),([manages#"newest-nodes

pig udf to calculate time difference in weblogs

阅读更多关于 pig udf to calculate time difference in weblogs

问题 Is there a Pig UDF that calculates time difference in the weblogs? Assuming I have weblogs in the below format: 10.171.100.10 - - [12/Jan/2012:14:39:46 +0530] "GET /amazon/navigator/index.php HTTP/1.1" 200 402 "someurl/page1" "Mozilla/4.0 ( compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.4506 .2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)" 10.171.100.10 - - [12/Jan/2012:14:41:47 +0530] "GET /amazon/header.php HTTP/1.1 " 200 4376 "someurl/page2"

CDH4 - Exception: java.lang.IncompatibleClassChangeError:

阅读更多关于 CDH4 - Exception: java.lang.IncompatibleClassChangeError:

问题 I am getting a java issue when I launch a pig script, it appears to be some dependency or version conflict, Running Debian/Cloudera CDH4/ Apache Pig java.lang.Exception: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406) Caused by: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected 回答1: The

Hadoop pig XPath returning empty attribute value

阅读更多关于 Hadoop pig XPath returning empty attribute value

问题 I am using cloudera Hadoop 2.6, pig 0.15 versions. I am trying to extract data from xml file. Below you can see part of xml file. <product productID="MICROLITEMX1600LAMP"> <basicInfo> <category lang="NL" id="OT1006">Output Accessoires</category> </basicInfo> </product> I can dump node values but not attribute values using XPath() function. You can see the code below which is returning empty tuples instead of productID. DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath(); allProducts

Pig UDF running on AWS EMR with java.lang.NoClassDefFoundError: org/apache/pig/LoadFunc

阅读更多关于 Pig UDF running on AWS EMR with java.lang.NoClassDefFoundError: org/apache/pig/LoadFunc

问题 I am developing an application that try to read log file stored in S3 bucks and parse it using Elastic MapReduce. Current the log file has following format ------------------------------- COLOR=Black Date=1349719200 PID=23898 Program=Java EOE ------------------------------- COLOR=White Date=1349719234 PID=23828 Program=Python EOE So I try to load the file into my Pig script, but the build-in Pig Loader doesn't seems be able to load my data, so I have to create my own UDF. Since I am pretty

how to save pig bag in json format

阅读更多关于 how to save pig bag in json format

问题 I'm running Pig example$ pig --version Apache Pig version 0.8.1-cdh3u1 (rexported) compiled Jul 18 2011, 08:29:40 on very simple dataset example$ hadoop fs -cat /user/pavel/trivial.log 1 one 2 two 3 three I'm trying to save the bag format as json by using the following script: REGISTER ./pig.jar; A = LOAD 'trivial.log' USING PigStorage('\t') AS (mynum: int, mynumstr: chararray); B = GROUP A BY mynum; DUMP B; STORE B into 'trivial_json.out' USING JsonStorage(); and I get an error: Backend