apache-pig

Hadoop Pig Ordered Analytical Functions

我的梦境 提交于 2019-12-11 19:23:29
问题 I am new in Pig and would like to use an ordered analytical function, similar to what is possible in SQL. My data looks something like this: (stock_symbol,date,stock_price_open,stock_price_close) (TAC,2001-08-06,16.39,16.36) (TAC,2001-08-07,16.3,16.54) (TAC,2001-08-08,16.55,16.44) (TAC,2001-08-09,16.45,16.48) (TAC,2001-08-10,16.5,15.8) What I want to do is find the change in opening stock price from day to day. So, my output would look something like this: (stock_symbol,date,stock_price_open

Pig xmlloader error when loading tag with colon

烈酒焚心 提交于 2019-12-11 19:22:54
问题 Ive been using Pig and XMLLOADER to load xml files. I've been practising on BOOK example. However, XML file I need to process has colons in tag. When I run a script it says that due to ':' it cannot be processed.(exact log at the end) This is the file I have. Modified for the purpose of ":" case. BOOKT.xml <CATALOG> <BC:BOOK id="1"> <TITLE>Hadoop Defnitive Guide</TITLE> <AUTHOR>Tom White</AUTHOR> <COUNTRY>US</COUNTRY> <COMPANY>CLOUDERA</COMPANY> <PRICE>24.90</PRICE> <YEAR>2012</YEAR> </BC

Error 1045 on sum function in pig latin with an int

纵饮孤独 提交于 2019-12-11 19:18:56
问题 The following pig latin script: data = load 'access_log_Jul95' using PigStorage(' ') as (ip:chararray, dash1:chararray, dash2:chararray, date:chararray, date1:chararray, getRequset:chararray, location:chararray, http:chararray, code:int, size:int); splitDate = foreach data generate size as size:int , ip as ip, FLATTEN(STRSPLIT(date, ':')) as h; groupedIp = group splitDate by h.$1; a = foreach groupedIp{ added = foreach splitDate generate SUM(size); -- generate added; }; describe a; gives me

Can't load avro schema in pig

末鹿安然 提交于 2019-12-11 19:08:15
问题 I have an avro schema, and I am writing data with that schema to an AvroSequenceFileOutputFormat . I looked in the file and can confirm that the schema is there to read. I call the function avro = load 'part-r-00000.avro' using AvroStorage(); and it gives me the error message ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: Cannot get schema from loadFunc org.apache.pig.builtin.AvroStorage Details at logfile: /Users/ajosephs/Code/serialization-protocol/output/pig_1391635368675.log Does

Combining Multiple Maps together in Pig

馋奶兔 提交于 2019-12-11 18:35:00
问题 I am using pig for the first time. I've gotten to the point where I have exactly the answer I want, but in a weirdly nested format: {(price,49),(manages,"1d74426f-2b0a-4777-ac1b-042268cab09c")} I'd like the output to be a single map, without any wrapping: [price#49, manages#"1d74426f-2b0a-4777-ac1b-042268cab09c"] I've managed to use TOMAP to get this far, but I can't figure out how to merge and flatten it away. {([price_specification#{"amount":49,"currency":"USD"}]),([manages#"newest-nodes

pig udf to calculate time difference in weblogs

▼魔方 西西 提交于 2019-12-11 18:27:12
问题 Is there a Pig UDF that calculates time difference in the weblogs? Assuming I have weblogs in the below format: 10.171.100.10 - - [12/Jan/2012:14:39:46 +0530] "GET /amazon/navigator/index.php HTTP/1.1" 200 402 "someurl/page1" "Mozilla/4.0 ( compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.4506 .2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)" 10.171.100.10 - - [12/Jan/2012:14:41:47 +0530] "GET /amazon/header.php HTTP/1.1 " 200 4376 "someurl/page2"

CDH4 - Exception: java.lang.IncompatibleClassChangeError:

浪尽此生 提交于 2019-12-11 18:09:30
问题 I am getting a java issue when I launch a pig script, it appears to be some dependency or version conflict, Running Debian/Cloudera CDH4/ Apache Pig java.lang.Exception: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406) Caused by: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected 回答1: The

Hadoop pig XPath returning empty attribute value

回眸只為那壹抹淺笑 提交于 2019-12-11 16:50:48
问题 I am using cloudera Hadoop 2.6, pig 0.15 versions. I am trying to extract data from xml file. Below you can see part of xml file. <product productID="MICROLITEMX1600LAMP"> <basicInfo> <category lang="NL" id="OT1006">Output Accessoires</category> </basicInfo> </product> I can dump node values but not attribute values using XPath() function. You can see the code below which is returning empty tuples instead of productID. DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath(); allProducts

Pig UDF running on AWS EMR with java.lang.NoClassDefFoundError: org/apache/pig/LoadFunc

人走茶凉 提交于 2019-12-11 16:47:47
问题 I am developing an application that try to read log file stored in S3 bucks and parse it using Elastic MapReduce. Current the log file has following format ------------------------------- COLOR=Black Date=1349719200 PID=23898 Program=Java EOE ------------------------------- COLOR=White Date=1349719234 PID=23828 Program=Python EOE So I try to load the file into my Pig script, but the build-in Pig Loader doesn't seems be able to load my data, so I have to create my own UDF. Since I am pretty

how to save pig bag in json format

人走茶凉 提交于 2019-12-11 16:44:45
问题 I'm running Pig example$ pig --version Apache Pig version 0.8.1-cdh3u1 (rexported) compiled Jul 18 2011, 08:29:40 on very simple dataset example$ hadoop fs -cat /user/pavel/trivial.log 1 one 2 two 3 three I'm trying to save the bag format as json by using the following script: REGISTER ./pig.jar; A = LOAD 'trivial.log' USING PigStorage('\t') AS (mynum: int, mynumstr: chararray); B = GROUP A BY mynum; DUMP B; STORE B into 'trivial_json.out' USING JsonStorage(); and I get an error: Backend