apache-pig | 易学教程

pig beginner's example [unexpected error]

阅读更多关于 pig beginner's example [unexpected error]

问题 I am new to Linux and Apache Pig. I am following this tutorial to learn pig: http://salsahpc.indiana.edu/ScienceCloud/pig_word_count_tutorial.htm This is a basic word counting example. The data file 'input.txt' and the program file 'wordcount.pig' are in the Wordcount package, linked on the site. I already have Pig 0.11.1 downloaded on my local machine, as well as Hadoop , and Java 6 . When I downloaded the Wordcount package it took me to a " tar.gz " file. I am unfamiliar with this type, and

Hive not detecting timestamp format

阅读更多关于 Hive not detecting timestamp format

问题 I have a PIG script that Loads and transforms the data from a csv Replaces some characters Calls a java program (JAR) to convert the date-time in csv from 06/02/2015 18:52 to 2015-6-2 18:52 (mm/DD/yyyy to yyyy-MM-dd) REGISTER /home/cloudera/DateTime.jar; A = Load '/user/cloudera/Data.csv' using PigStorage(',') as (ac,datetime,amt,trace); B = FOREACH A GENERATE ac, REPLACE(datetime, '\\/','-') as newdate,REPLACE(amt,'-','') as newamt,trace; C = FOREACH B GENERATE ac,Converter.DateTime(newdate)

Replace character in pig

阅读更多关于 Replace character in pig

问题 My data is in the following format.. {"Foo":"ABC","Bar":"20090101100000","Quux":"{\"QuuxId\":1234,\"QuuxName\":\"Sam\"}"} I need it to be in this format: {"Foo":"ABC","Bar":"20090101100000","Quux":{"QuuxId":1234,"QuuxName":"Sam"}} I'm trying to using Pig's replace function to get it in the format I need.. So, I tried .. "LOGS = LOAD 'inputloc' USING TextStorage() as unparsedString:chararray;;" + "REPL1 = foreach LOGS REPLACE($0, '"{', '{');" + "REPL2 = foreach REPL1 REPLACE($0, '}"', '}');"

Loading files into pig and decompressing them

阅读更多关于 Loading files into pig and decompressing them

问题 I am loading a bunch of files from Azure storage into pig. Pig has default support for gzip so if the file extensions are .gz everything works fine. Problem is that older files are stored with .zip extension (I have millions of those). Is there a way to tell pig to load files and treat .zip as gzip? 回答1: I really don't know some other options are available but you can try something like this write a bash script which will convert the given zip file to gz file load the gz file in pig Just a

Pig Performance Measurement

阅读更多关于 Pig Performance Measurement

问题 I wrote a Pig script and want to execute it on Hadoop cluster. How could I measure the total processing time? Is there any command that I could get the processing time from start to end? 回答1: EDIT: Added the time alternative. To know how long it takes (in seconds): time pig <options> Another way to do it: d1=$(date +%s) pig <options> d2=$(date +%s) echo "$d2 - $d1" | bc Or, in a single line: d1=$(date +%s) ; pig <options> ; d2=$(date +%s) ; echo "$d2 - $d1" | bc You can also just take a look

Not able to split chararray field containing spaces and tabs between the words. Help me with the command using Apache Pig?

阅读更多关于 Not able to split chararray field containing spaces and tabs between the words. Help me with the command using Apache Pig?

问题 Sample.txt File 2017-01-01 10:21:59 THURSDAY -39 3 Pick up a bus - Travel for two hours 2017-02-01 12:45:19 FRIDAY -55 8 Pick up a train - Travel for one hour 2017-03-01 11:35:49 SUNDAY -55 8 Pick up a train - Travel for one hour I . . When I executed the suggested command, it got split into three fields. when I do the below operation, it is not working as expected. A = LOAD 'Sample.txt' USING PigStorage() as (line:chararray); B = foreach A generate STRSPLIT(line, ' ', 3); c = foreach B

accessing an element like array in pig

阅读更多关于 accessing an element like array in pig

问题 I have data in the form: id,val1,val2 example 1,0.2,0.1 1,0.1,0.7 1,0.2,0.3 2,0.7,0.9 2,0.2,0.3 2,0.4,0.5 So first I want to sort each id by val1 in decreasing order..so somethng like 1,0.2,0.1 1,0.2,0.3 1,0.1,0.7 2,0.7,0.9 2,0.4,0.5 2,0.2,0.3 And then select the second element id,val2 combination for each id So for example: 1,0.3 2,0.5 How do I approach this? Thanks 回答1: Pig is a scripting language and not relational one like SQL, it is well suited to work with groups with operators nested

Subtract One row's value from another row in Pig

阅读更多关于 Subtract One row's value from another row in Pig

问题 I'm trying to develop a sample program using Pig to analyse some log files. I want to analyze the running time of different jobs. When I read in the log file of the job, I get the start time and the end time of the job, like this: (Wed,03/20/13,01:03:37,EDT) (Wed,03/20/13,01:05:00,EDT) Now, to calculate the elapsed time, I need to subtract these 2 timestamps, but since both timestamps are in the same bag, I'm not sure how to compare them. So I'm looking for an idea on how to do this. thanks!

Getting error as Failed to create data storage when trying to load the data from HDFS with MovieLens data

阅读更多关于 Getting error as Failed to create data storage when trying to load the data from HDFS with MovieLens data

问题 I am trying to load data from HDFS to Pig but I am getting error as Failed to create Data Storage. The command that I executed was: movies = LOAD 'hdfs://localhost:9000/Movie_Lens/ratings' USING PigStorage(':') AS (user_id, dummy1, movie_id, dummy2, movie_rating, dummy3, timestamp); I tried to find the mentioned problem in stack overflow but the link that I got are not related to HDFS and Pig, they are related to HDFS and HBase or Pig and HBase. The detail of the log file is mentioned below.

Calculating percentage in a pig query

阅读更多关于 Calculating percentage in a pig query

问题 I have a table with two columns (col1:string, col2:boolean) Lets say col1 = "aaa" For col1 = "aaa", there are many True/False values of col2 I want to calculate the percentage of True values for col1 (aaa) INPUT: aaa T aaa F aaa F bbb T bbb T ccc F ccc F OUTPUT COL1 TOTAL_ROWS_IN_INPUT_TABLE PERCENTAGE_TRUE_IN_INPUT_TABLE aaa 3 33% bbb 2 100% ccc 2 0% How would I do this using PIG (LATIN)? 回答1: In Pig 0.10 SUM(INPUT.col2) does not work and casting to boolean is not possible as it treats INPUT