apache-pig | 易学教程

How Do I transpose columns and rows in PIG

阅读更多关于 How Do I transpose columns and rows in PIG

问题 I'm not sure if this can be done with builtin PIG scripts or I'll need to code a UDF. But I have essentially a table where I simply want to transpose the data. Simple put, given: (1, 2, 3, 4, 5) (6, 7, 8, 9, 10) (11, 12, 13, 14, 15) ... 300 plus more tuples I would end up with: (1,6,11,...) -> goes on for a few hundred more (2,7,12,...) (3,8,13,...) (4,9,14,...) (5,10,15,...) Any suggestions on how I could accomplish this? 回答1: This is not possible with Pig, nor does it make much sense for it

how to do Transpose in corresponding few columns in pig/hive

阅读更多关于 how to do Transpose in corresponding few columns in pig/hive

问题 I was wondering is it possible to do transposition corresponding few columns in pig/hive. as dealing with data i got below requirement id jan feb march 1 j1 f1 m1 2 j2 f2 m2 3 j3 f3 m3 where i need to transpose it against first column, so it would look like - id value month 1 j1 jan 1 f1 feb 1 m1 march 2 j2 jan 2 f2 feb 2 m2 march 3 j3 jan 3 f3 feb 3 m3 march I have tried this with java, but to get it into distributed mode is there any way to do it in pig/hive. appreciating your help in

Pig: loading a data file using an external schema file

阅读更多关于 Pig: loading a data file using an external schema file

问题 I have a data file and a corresponding schema file stored in separate locations. I would like to load the data using the schema in the schema-file. I tried using A= LOAD '<file path>' USING PigStorage('\u0001') as '<schema-file path>' but get an error. What is the syntax for correctly loading the file? The schema file format is something like: data1 - complex - - - - format - - data1 event_type - - - - - long - "ends '\001'" data1 event_id - - - - - varchar(50) - "ends '\001'" data1 name

Storing data to SequenceFile from Apache Pig

阅读更多关于 Storing data to SequenceFile from Apache Pig

问题 Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader : REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar; DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); log = LOAD '/data/logs' USING SequenceFileLoader AS (...) Is there also a library out there that would allow writing to Hadoop sequence files from Pig? 回答1: It's just a matter of implementing a StoreFunc to do so. This is possible now, although it will become

Storing data to SequenceFile from Apache Pig

阅读更多关于 Storing data to SequenceFile from Apache Pig

PIG Script REPLACE with pipe symbol

阅读更多关于 PIG Script REPLACE with pipe symbol

问题 I want to strip characters outside of the curly brackets in rows that look like the following. 35|{......}| Stripping the '35|' from the front and the trailing '|' from the end. {.....} Initially working on the first 3 characters, I try the following but it removes everything. a = LOAD '/file' as (line1:chararray); b = FOREACH x generate REPLACE(line1, '35|',''); dump b; Any thoughts appreciated. Thanks. 回答1: | and { and } are special characters in regular expressions and the second parameter

Pig equivalent of SQL GREATEST / LEAST?

阅读更多关于 Pig equivalent of SQL GREATEST / LEAST?

问题 I'm trying to find the Pig equivalent of the SQL functions GREATEST and LEAST. These functions are the scalar equivalent of the aggregate SQL functions MAX and MIN , respectively. Essentially, I want to be able to say something like this: x = LOAD 'file:///a/b/c.csv' USING PigStorage() AS (a: int, b: int, c: int); y = FOREACH x GENERATE a AS a: int, b AS b: int, c AS c: int, GREATEST(a, b, c) AS g: int; I know I could use bags and MAX to get this done, but I'm translating from another

Pig & Cassandra & DataStax Splits Control

阅读更多关于 Pig & Cassandra & DataStax Splits Control

问题 I have been using Pig with my Cassandra data to do all kinds of amazing feats of groupings that would be almost impossible to write imperatively. I am using DataStax's integration of Hadoop & Cassandra, and I have to say it is quite impressive. Hat-off to those guys!! I have a pretty small sandbox cluster (2-nodes) where I am putting this system thru some tests. I have a CQL table that has ~53M rows (about 350 bytes ea.), and I notice that the Mapper later takes a very long time to grind thru

Usage of Apache Pig rank function

阅读更多关于 Usage of Apache Pig rank function

问题 Am using Pig 0.11.0 rank function and generating ranks for every id in my data. I need ranking of my data in a particular way. I want the rank to reset and start from 1 for every new ID. Is it possible to use the rank function directly for the same? Any tips would be appreciated. Data: id,rating X001, 9 X001, 9 X001, 8 X002, 9 X002, 7 X002, 6 X002, 5 X003, 8 X004, 8 X004, 7 X004, 7 X004, 4 On using rank function like: op = rank data by id,score; I get this output rank,id,rating 1, X001, 9 1,

Usage of Apache Pig rank function

阅读更多关于 Usage of Apache Pig rank function