apache-pig | 易学教程

apache pig - url parsing into a map

阅读更多关于 apache pig - url parsing into a map

问题 I am pretty new to pig and have a question with log parsing. I currently parse out important tags in my url string via regex_extract, but am thinking I should transform the whole string to a map. I am working on a sample set of data using 0.10, but am starting to get really lost. In reality, my url string has tags repeated. So my map should actually be a map with bags as the values. Then i could just write any subsequent job using flatten.. here is my test data. the last entry shows my

How to load multi-line column data in hive table? Columns having new line characters

阅读更多关于 How to load multi-line column data in hive table? Columns having new line characters

问题 I have a column (not the last column) in Excel file that contains data which is spanning over few lines. Some cells of column is blank and some have single lines entries. When saving as .CSV file or a tab separated .txt from excel, all the multi-line data and few single line entries are getting generated in double quotes, None of the blank fields are in quotes. Some of the single line entries are not within quotes. Is it possible to store the data with this same structure in a hive table? If

bincod evaluation in pig

阅读更多关于 bincod evaluation in pig

问题 I am trying to replace the missing values with some precomputed value. So i posted the question here and followed the advice and here is the code snippet input = LOAD 'data.txt' USING PigStorage(',') AS ( id1:double , id21:double ); gin = foreach input generate id1 IS NULL ? 2 : id1, id2 IS NULL ? 4 : id2; But I am getting an error mismatched input 'IS' expecting SEMI_COLON? 回答1: Try adding parentheses in the bincond. The following works properly for me: Contents of input : 0.9,1.11 ,0.3 10.3

Can I use "filter by' with Map structure in hadoop - PIG?

阅读更多关于 Can I use "filter by' with Map structure in hadoop - PIG?

问题 provied that there's a Map like,,, map.text [key1#v1] [key2#v2] [key3#v3] then, if I try to find 'value of 'key2'', A = load ‘map.text’ as (M:map[]); B = foreach A generate M#'key2'; C = filter B by $0!=''; // to get rid of empty value like (), (), (). dump C; is there any other way to find key2? with using 'filter by' only. thxs ya. 回答1: There is no need to GENERATE a field and then use it in a FILTER ; you can include it in the FILTER statement to begin with: A = load 'map.text' as (M:map[]

Encoding in Pig

阅读更多关于 Encoding in Pig

问题 Loading data that contains some particular characters (as for example, À, ° and others) using Pig Latin and storing data in a .txt file is possible to see that these symbols in a txt file are displayed as ï¿½ and ï characters. That happens because of UTF-8 substitution character. I would like to ask if is possible to avoid it somehow, maybe with some pig commands, to have in the result (in txt file) for example À instead of ï¿½? 回答1: In Pig we have built in dynamic invokers that that allow a

Pig casting / datatypes

阅读更多关于 Pig casting / datatypes

问题 I'm trying to dump relation into AVRO file but I'm getting a strange error: org.apache.pig.data.DataByteArray cannot be cast to java.lang.CharSequence I don't use DataByteArray (bytearray), see description of the relation below. sensitiveSet: {rank_ID: long,name: chararray,customerId: long,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray} Even when I do explicit casting I get the same error: sensitiveSet = foreach sensitiveSet generate (long) $0,

Apache Pig process CSV with fields wrapped in quotes

阅读更多关于 Apache Pig process CSV with fields wrapped in quotes

问题 How I can process CSV file where some fields are wrapped in quotes? Line to process for example (field delimiter is ',') I am column1, I am column2, "yes, I'm am column3" The example has three columns. But the following example will say that I have four columns: A = load '/path/to/file' using PigStorage(','); Please, any suggestions, link to resource..? 回答1: Try loading the data, then do a FOREACH GENERATE to regenerate the data into whatever format you need. For the fields where you need to

schema of flatten operator in pig latin

阅读更多关于 schema of flatten operator in pig latin

问题 i recently meet this problem in my work, it's about pig flatten. i use a simple example to express it two files ===file1=== 1_a 2_b 4_d ===file2 (tab seperated)=== 1 a 2 b 3 c pig script 1: a = load 'file1' as (str:chararray); b = load 'file2' as (num:int, ch:chararray); a1 = foreach a generate flatten(STRSPLIT(str,'_',2)) as (num:int, ch:chararray); c = join a1 by num, b by num; dump c; -- exception java.lang.String cannot be cast to java.lang.Integer pig script 2: a = load 'file1' as (str

Aggregate row value into columns

阅读更多关于 Aggregate row value into columns

问题 I have data like this: 2013-11 localhost kern 2013-11 localhost kern 2013-11 192.168.0.59 daemon 2013-12 localhost kern 2013-12 localhost daemon 2013-12 localhost mail You get the idea. I'm trying to group the above by date (as the row key) and have a column which correspond to the count of each kern , daemon , etc. In short, my desired output should be as below: -- date, count(kern), count(daemon), count(mail) (2013-11, 2, 1, 0) (2013-12, 1, 1, 1) Currently, my approach is like this. valid

pig skewed join with a big table causes “Split metadata size exceeded 10000000”

阅读更多关于 pig skewed join with a big table causes “Split metadata size exceeded 10000000”

问题 We have a pig join between a small (16M rows) distinct table and a big (6B rows) skewed table. A regular join finishes in 2 hours (after some tweaking). We tried using skewed and been able to improve the performance to 20 minutes. HOWEVER, when we try a bigger skewed table (19B rows), we get this message from the SAMPLER job: Split metadata size exceeded 10000000. Aborting job job_201305151351_21573 [ScriptRunner] at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo