apache-pig

Pig Latin split columns to rows

心已入冬 提交于 2020-01-30 12:00:07
问题 Is there any solution in Pig latin to transform columns to rows to get the below? Input: id|column1|column2 1|a,b,c|1,2,3 2|d,e,f|4,5,6 required output: id|column1|column2 1|a|1 1|b|2 1|c|3 2|d|4 2|e|5 2|f|6 thanks 回答1: I'm willing to bet this is not the best way to do this however ... data = load 'input' using PigStorage('|') as (id:chararray, col1:chararray, col2:chararray); A = foreach data generate id, flatten(TOKENIZE(col1)); B = foreach data generate id, flatten(TOKENIZE(col2)); RA =

Pig Latin split columns to rows

我是研究僧i 提交于 2020-01-30 11:59:26
问题 Is there any solution in Pig latin to transform columns to rows to get the below? Input: id|column1|column2 1|a,b,c|1,2,3 2|d,e,f|4,5,6 required output: id|column1|column2 1|a|1 1|b|2 1|c|3 2|d|4 2|e|5 2|f|6 thanks 回答1: I'm willing to bet this is not the best way to do this however ... data = load 'input' using PigStorage('|') as (id:chararray, col1:chararray, col2:chararray); A = foreach data generate id, flatten(TOKENIZE(col1)); B = foreach data generate id, flatten(TOKENIZE(col2)); RA =

Apache PIG - How to cut digits after decimal point

一世执手 提交于 2020-01-25 16:23:11
问题 Is there any possibility to cut a certain area after the decimal point of a float or double number? For example: the result would be 2.67894 => I want to have 2.6 as result (and not 2.7 when rounded). 回答1: try it.. val is your values like 2.666,3.666,4.666666,5.3456334..... b = foreach a GENERATE (FLOOR(val * 10) / 10); dump b; 回答2: Write a UDF (User Defined Function) for this. A very simple python UDF (numformat.py): @outputSchema('value:double') def format(data): return round(data,1) (Of

Apache PIG - How to cut digits after decimal point

余生长醉 提交于 2020-01-25 16:20:30
问题 Is there any possibility to cut a certain area after the decimal point of a float or double number? For example: the result would be 2.67894 => I want to have 2.6 as result (and not 2.7 when rounded). 回答1: try it.. val is your values like 2.666,3.666,4.666666,5.3456334..... b = foreach a GENERATE (FLOOR(val * 10) / 10); dump b; 回答2: Write a UDF (User Defined Function) for this. A very simple python UDF (numformat.py): @outputSchema('value:double') def format(data): return round(data,1) (Of

Looking up variable keys in pig map

核能气质少年 提交于 2020-01-24 20:36:08
问题 I'm trying to use pig to break text into lowercased words, and then look up each word in a map. Here's my example map, which I have in map.txt (it is only 1 line long): [this#1.9,is#2.5my#3.3,vocabulary#4.1] I load this like so: M = LOAD 'mapping.txt' USING PigStorage AS (mp: map[float]); which works just fine. Then I do the following to load the text and break it into lowercased words: LINES = LOAD 'test.txt' USING TextLoader() AS (line:chararray); TOKENS = FOREACH LINES GENERATE FLATTEN

finding mean using pig or hadoop

自古美人都是妖i 提交于 2020-01-22 14:38:52
问题 I have a huge text file of form data is saved in directory data/data1.txt, data2.txt and so on merchant_id, user_id, amount 1234, 9123, 299.2 1233, 9199, 203.2 1234, 0124, 230 and so on.. What I want to do is for each merchant, find the average amount.. so basically in the end i want to save the output in file. something like merchant_id, average_amount 1234, avg_amt_1234 a and so on. How do I calculate the standard deviation as well? Sorry for asking such a basic question. :( Any help would

Manipulate row data in hadoop to add missing columns

99封情书 提交于 2020-01-17 04:14:29
问题 I have log files from IIS stored in hdfs, but due to webserver configuration some of the logs do not have all the columns or they appear in different order. I want to generate files that have a common schema so I can define a Hive table over them. Example good log: #Fields: date time s-ip cs-method cs-uri-stem useragent 2013-07-16 00:00:00 10.1.15.8 GET /common/viewFile/1232 Mozilla/5.0+Chrome/27.0.1453.116 Example log with missing columns (cs-method and useragent missing): #Fields: date time

Manipulate row data in hadoop to add missing columns

南楼画角 提交于 2020-01-17 04:14:28
问题 I have log files from IIS stored in hdfs, but due to webserver configuration some of the logs do not have all the columns or they appear in different order. I want to generate files that have a common schema so I can define a Hive table over them. Example good log: #Fields: date time s-ip cs-method cs-uri-stem useragent 2013-07-16 00:00:00 10.1.15.8 GET /common/viewFile/1232 Mozilla/5.0+Chrome/27.0.1453.116 Example log with missing columns (cs-method and useragent missing): #Fields: date time

embedded hadoop-pig: what's the correct way to use the automatic addContainingJar for UDFs?

空扰寡人 提交于 2020-01-16 19:18:29
问题 when you use pigServer.registerFunction, you're not supposed to explicitly call pigServer.registerJar, but rather have pig automatically detect the jar using jarManager.findContainingJar. However, we have a complex UDF who's class is dependent on other classes from multiple jars. So we created a jar-with-dependencies with the maven-assembly. But this causes the entire jar to enter pigContext.skipJars (as it contains the pig.jar itself) and not being sent to the hadoop server :( What's the

CqlStorage generates wrong Pig schema

最后都变了- 提交于 2020-01-16 04:50:07
问题 I'm loading some simple data from Cassandra into Pig using CqlStorage . The CqlStorage loader defines a schema based on the Cassandra schema, but it seems to be wrong. If I do: data = LOAD 'cql://bookdata/books' USING CqlStorage(); DESCRIBE data; I get this: data: {isbn: chararray,bookauthor: chararray,booktitle: chararray,publisher: chararray,yearofpublication: int} However, if I DUMP data , I get results like these: ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in the