apache-pig | 易学教程

Pig Latin split columns to rows

阅读更多关于 Pig Latin split columns to rows

问题 Is there any solution in Pig latin to transform columns to rows to get the below? Input: id|column1|column2 1|a,b,c|1,2,3 2|d,e,f|4,5,6 required output: id|column1|column2 1|a|1 1|b|2 1|c|3 2|d|4 2|e|5 2|f|6 thanks 回答1: I'm willing to bet this is not the best way to do this however ... data = load 'input' using PigStorage('|') as (id:chararray, col1:chararray, col2:chararray); A = foreach data generate id, flatten(TOKENIZE(col1)); B = foreach data generate id, flatten(TOKENIZE(col2)); RA =

Pig Latin split columns to rows

阅读更多关于 Pig Latin split columns to rows

Apache PIG - How to cut digits after decimal point

阅读更多关于 Apache PIG - How to cut digits after decimal point

问题 Is there any possibility to cut a certain area after the decimal point of a float or double number? For example: the result would be 2.67894 => I want to have 2.6 as result (and not 2.7 when rounded). 回答1: try it.. val is your values like 2.666,3.666,4.666666,5.3456334..... b = foreach a GENERATE (FLOOR(val * 10) / 10); dump b; 回答2: Write a UDF (User Defined Function) for this. A very simple python UDF (numformat.py): @outputSchema('value:double') def format(data): return round(data,1) (Of

Apache PIG - How to cut digits after decimal point

阅读更多关于 Apache PIG - How to cut digits after decimal point

Looking up variable keys in pig map

阅读更多关于 Looking up variable keys in pig map

问题 I'm trying to use pig to break text into lowercased words, and then look up each word in a map. Here's my example map, which I have in map.txt (it is only 1 line long): [this#1.9,is#2.5my#3.3,vocabulary#4.1] I load this like so: M = LOAD 'mapping.txt' USING PigStorage AS (mp: map[float]); which works just fine. Then I do the following to load the text and break it into lowercased words: LINES = LOAD 'test.txt' USING TextLoader() AS (line:chararray); TOKENS = FOREACH LINES GENERATE FLATTEN

finding mean using pig or hadoop

阅读更多关于 finding mean using pig or hadoop

问题 I have a huge text file of form data is saved in directory data/data1.txt, data2.txt and so on merchant_id, user_id, amount 1234, 9123, 299.2 1233, 9199, 203.2 1234, 0124, 230 and so on.. What I want to do is for each merchant, find the average amount.. so basically in the end i want to save the output in file. something like merchant_id, average_amount 1234, avg_amt_1234 a and so on. How do I calculate the standard deviation as well? Sorry for asking such a basic question. :( Any help would

Manipulate row data in hadoop to add missing columns

阅读更多关于 Manipulate row data in hadoop to add missing columns

问题 I have log files from IIS stored in hdfs, but due to webserver configuration some of the logs do not have all the columns or they appear in different order. I want to generate files that have a common schema so I can define a Hive table over them. Example good log: #Fields: date time s-ip cs-method cs-uri-stem useragent 2013-07-16 00:00:00 10.1.15.8 GET /common/viewFile/1232 Mozilla/5.0+Chrome/27.0.1453.116 Example log with missing columns (cs-method and useragent missing): #Fields: date time

Manipulate row data in hadoop to add missing columns

阅读更多关于 Manipulate row data in hadoop to add missing columns

embedded hadoop-pig: what's the correct way to use the automatic addContainingJar for UDFs?

阅读更多关于 embedded hadoop-pig: what's the correct way to use the automatic addContainingJar for UDFs?

问题 when you use pigServer.registerFunction, you're not supposed to explicitly call pigServer.registerJar, but rather have pig automatically detect the jar using jarManager.findContainingJar. However, we have a complex UDF who's class is dependent on other classes from multiple jars. So we created a jar-with-dependencies with the maven-assembly. But this causes the entire jar to enter pigContext.skipJars (as it contains the pig.jar itself) and not being sent to the hadoop server :( What's the

CqlStorage generates wrong Pig schema

阅读更多关于 CqlStorage generates wrong Pig schema

问题 I'm loading some simple data from Cassandra into Pig using CqlStorage . The CqlStorage loader defines a schema based on the Cassandra schema, but it seems to be wrong. If I do: data = LOAD 'cql://bookdata/books' USING CqlStorage(); DESCRIBE data; I get this: data: {isbn: chararray,bookauthor: chararray,booktitle: chararray,publisher: chararray,yearofpublication: int} However, if I DUMP data , I get results like these: ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in the