apache-pig

Pig 0.7.0 ERROR 2118: Unable to create input splits on Hadoop 1.2.1

一曲冷凌霜 提交于 2020-01-06 13:51:09
问题 I got output file(stored on HDFS) from map reduce program. now I am trying to load that file using PIG 0.7.0. I am getting following error. I have tried copying this file to local machine and ran pig in local mode which works fine. but I want to skip this step and make it work from map reduce mode. options I tried: LOAD 'file://log/part-00000', LOAD '/log/part-00000', LOAD 'hdfs:/log/part-00000', LOAD 'hdfs://localhost:50070/log/part-00000', hadoop dfs -ls /log/ Warning: $HADOOP_HOME is

Pig - use ternary condition to filter based on different condition

ε祈祈猫儿з 提交于 2020-01-06 12:56:27
问题 I'm trying to use PIG to filter a relation based on the current day of the week using a ternary condition but it's giving me an error that I haven't seen yet. This is what I'm trying to do: C = filter B by (DaysBetween(CurrentTime(),ToDate(0L)) % 7) == (long)0 ? B.interval == 'daily' : B.interval == 'weekly'; and the error that returns is: ERROR 1200: Pig script failed to parse: NoViableAltException(84@[]) Failed to parse: Pig script failed to parse: NoViableAltException(84@[]) at org.apache

How to solve the error when file doesn't exist in setting coordinator oozie

孤人 提交于 2020-01-06 07:14:32
问题 How to solution when error file doesnt exist in setting coordinator oozie: I have error in log coodinator: Pig logfile dump: Backend error message Error: java.io.FileNotFoundException: File does not exist: /user/hdfs/jay/part-0.tmp settingan coordinator: <coordinator-app name="tes-ng" frequency="${coord:minutes(15)}" start="2015-12-07T10:30+0700" end="2017-02-28T23:00+0700" timezone="Asia/Jakarta" xmlns="uri:oozie:coordinator:0.1" xmlns:sla="uri:oozie:sla:0.1"> <controls> <execution>LAST_ONLY

CASE statement in PIG

孤街醉人 提交于 2020-01-06 04:46:06
问题 I am trying to extract 'vertex_code' from 'geocode' based on few conditions: SUBSTRING(geocode,0,2) ----> Code 00-51 ----> 01 70 ----> 03 61-78 ----> 04 Else ----> 00 Now the obtained 'code' value has to be concatenated with 'geocode' value (prefix) and again concatenated with 00 at the end (suffix) to form the 'vertex_code' eg: geocode = 44556677 if SUBSTRING(geocode,0,2) is between 00-51 , then code=01 hence vertex_code = 014455667700 Below is my script: item = load '/user/item.txt' USING

Pig - Remove embedded newlines and commas in gzip files

℡╲_俬逩灬. 提交于 2020-01-06 04:19:27
问题 I have a gzip file with data field separated by commas. I am currently using PigStorage to load the file as shown below: A = load 'myfile.gz' USING PigStorage(',') AS (id,date,text); The data in the gzip file has embedded characters - embedded newlines and commas. These characters exist in all the three fields - id, date and text. The embedded characters are always within the "" quotes. I would like to replace or remove these characters using Pig before doing any further processing. I think I

Pig - Remove embedded newlines and commas in gzip files

你说的曾经没有我的故事 提交于 2020-01-06 04:19:16
问题 I have a gzip file with data field separated by commas. I am currently using PigStorage to load the file as shown below: A = load 'myfile.gz' USING PigStorage(',') AS (id,date,text); The data in the gzip file has embedded characters - embedded newlines and commas. These characters exist in all the three fields - id, date and text. The embedded characters are always within the "" quotes. I would like to replace or remove these characters using Pig before doing any further processing. I think I

FILTER ON column from another relation in PIG

限于喜欢 提交于 2020-01-06 01:45:55
问题 Suppose, I have the following data in PIG. DUMP raw; (2015-09-15T22:11:00.000-07:00,1) (2015-09-15T22:12:00.000-07:00,2) (2015-09-15T23:11:00.000-07:00,3) (2015-09-16T21:02:00.000-07:00,4) (2015-09-15T00:02:00.000-07:00,5) (2015-09-17T08:02:00.000-07:00,5) (2015-09-17T09:02:00.000-07:00,5) (2015-09-17T09:02:00.000-07:00,1) (2015-09-17T19:02:00.000-07:00,1) DESCRIBE raw; raw: {process_date: chararray,id: int} A = GROUP raw BY id; DESCRIBE A; A: {group: int,raw: {(process_date: chararray,id:

Pig: Hadoop jobs Fail

…衆ロ難τιáo~ 提交于 2020-01-05 09:34:34
问题 I have a pig script that queries data from a csv file. The script has been tested locally with small and large .csv files. In Small Cluster: It starts with processing the scripts, and fails after completing 40% of the call The error is just, Failed to read data from "path to file" What I infer is that, The script could read the file, but there is some connection drop, a message lose But I get the above mentioned error only. 回答1: An answer for the General Problem would be changing the errors

apache pig - url parsing into a map

一世执手 提交于 2020-01-05 07:52:33
问题 I am pretty new to pig and have a question with log parsing. I currently parse out important tags in my url string via regex_extract, but am thinking I should transform the whole string to a map. I am working on a sample set of data using 0.10, but am starting to get really lost. In reality, my url string has tags repeated. So my map should actually be a map with bags as the values. Then i could just write any subsequent job using flatten.. here is my test data. the last entry shows my