apache-pig

How Do I transpose columns and rows in PIG

时光总嘲笑我的痴心妄想 提交于 2019-12-20 06:36:52
问题 I'm not sure if this can be done with builtin PIG scripts or I'll need to code a UDF. But I have essentially a table where I simply want to transpose the data. Simple put, given: (1, 2, 3, 4, 5) (6, 7, 8, 9, 10) (11, 12, 13, 14, 15) ... 300 plus more tuples I would end up with: (1,6,11,...) -> goes on for a few hundred more (2,7,12,...) (3,8,13,...) (4,9,14,...) (5,10,15,...) Any suggestions on how I could accomplish this? 回答1: This is not possible with Pig, nor does it make much sense for it

how to do Transpose in corresponding few columns in pig/hive

被刻印的时光 ゝ 提交于 2019-12-20 00:59:48
问题 I was wondering is it possible to do transposition corresponding few columns in pig/hive. as dealing with data i got below requirement id jan feb march 1 j1 f1 m1 2 j2 f2 m2 3 j3 f3 m3 where i need to transpose it against first column, so it would look like - id value month 1 j1 jan 1 f1 feb 1 m1 march 2 j2 jan 2 f2 feb 2 m2 march 3 j3 jan 3 f3 feb 3 m3 march I have tried this with java, but to get it into distributed mode is there any way to do it in pig/hive. appreciating your help in

Pig: loading a data file using an external schema file

风格不统一 提交于 2019-12-19 19:49:42
问题 I have a data file and a corresponding schema file stored in separate locations. I would like to load the data using the schema in the schema-file. I tried using A= LOAD '<file path>' USING PigStorage('\u0001') as '<schema-file path>' but get an error. What is the syntax for correctly loading the file? The schema file format is something like: data1 - complex - - - - format - - data1 event_type - - - - - long - "ends '\001'" data1 event_id - - - - - varchar(50) - "ends '\001'" data1 name

Storing data to SequenceFile from Apache Pig

徘徊边缘 提交于 2019-12-19 17:14:28
问题 Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader : REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar; DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); log = LOAD '/data/logs' USING SequenceFileLoader AS (...) Is there also a library out there that would allow writing to Hadoop sequence files from Pig? 回答1: It's just a matter of implementing a StoreFunc to do so. This is possible now, although it will become

Storing data to SequenceFile from Apache Pig

橙三吉。 提交于 2019-12-19 17:13:39
问题 Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader : REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar; DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); log = LOAD '/data/logs' USING SequenceFileLoader AS (...) Is there also a library out there that would allow writing to Hadoop sequence files from Pig? 回答1: It's just a matter of implementing a StoreFunc to do so. This is possible now, although it will become

PIG Script REPLACE with pipe symbol

一世执手 提交于 2019-12-19 09:57:32
问题 I want to strip characters outside of the curly brackets in rows that look like the following. 35|{......}| Stripping the '35|' from the front and the trailing '|' from the end. {.....} Initially working on the first 3 characters, I try the following but it removes everything. a = LOAD '/file' as (line1:chararray); b = FOREACH x generate REPLACE(line1, '35|',''); dump b; Any thoughts appreciated. Thanks. 回答1: | and { and } are special characters in regular expressions and the second parameter

Pig equivalent of SQL GREATEST / LEAST?

此生再无相见时 提交于 2019-12-19 09:57:10
问题 I'm trying to find the Pig equivalent of the SQL functions GREATEST and LEAST. These functions are the scalar equivalent of the aggregate SQL functions MAX and MIN , respectively. Essentially, I want to be able to say something like this: x = LOAD 'file:///a/b/c.csv' USING PigStorage() AS (a: int, b: int, c: int); y = FOREACH x GENERATE a AS a: int, b AS b: int, c AS c: int, GREATEST(a, b, c) AS g: int; I know I could use bags and MAX to get this done, but I'm translating from another

Pig & Cassandra & DataStax Splits Control

两盒软妹~` 提交于 2019-12-19 09:08:14
问题 I have been using Pig with my Cassandra data to do all kinds of amazing feats of groupings that would be almost impossible to write imperatively. I am using DataStax's integration of Hadoop & Cassandra, and I have to say it is quite impressive. Hat-off to those guys!! I have a pretty small sandbox cluster (2-nodes) where I am putting this system thru some tests. I have a CQL table that has ~53M rows (about 350 bytes ea.), and I notice that the Mapper later takes a very long time to grind thru

Usage of Apache Pig rank function

核能气质少年 提交于 2019-12-19 06:58:55
问题 Am using Pig 0.11.0 rank function and generating ranks for every id in my data. I need ranking of my data in a particular way. I want the rank to reset and start from 1 for every new ID. Is it possible to use the rank function directly for the same? Any tips would be appreciated. Data: id,rating X001, 9 X001, 9 X001, 8 X002, 9 X002, 7 X002, 6 X002, 5 X003, 8 X004, 8 X004, 7 X004, 7 X004, 4 On using rank function like: op = rank data by id,score; I get this output rank,id,rating 1, X001, 9 1,

Usage of Apache Pig rank function

此生再无相见时 提交于 2019-12-19 06:58:33
问题 Am using Pig 0.11.0 rank function and generating ranks for every id in my data. I need ranking of my data in a particular way. I want the rank to reset and start from 1 for every new ID. Is it possible to use the rank function directly for the same? Any tips would be appreciated. Data: id,rating X001, 9 X001, 9 X001, 8 X002, 9 X002, 7 X002, 6 X002, 5 X003, 8 X004, 8 X004, 7 X004, 7 X004, 4 On using rank function like: op = rank data by id,score; I get this output rank,id,rating 1, X001, 9 1,