apache-pig

pig skewed join with a big table causes “Split metadata size exceeded 10000000”

给你一囗甜甜゛ 提交于 2020-01-03 13:08:22
问题 We have a pig join between a small (16M rows) distinct table and a big (6B rows) skewed table. A regular join finishes in 2 hours (after some tweaking). We tried using skewed and been able to improve the performance to 20 minutes. HOWEVER, when we try a bigger skewed table (19B rows), we get this message from the SAMPLER job: Split metadata size exceeded 10000000. Aborting job job_201305151351_21573 [ScriptRunner] at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo

In Apache Pig how can I serialise columns into rows?

久未见 提交于 2020-01-03 05:27:08
问题 In Apache Pig I want to serialise columns held in a variable into rows. More specifically: The data, loaded into the variable, look (via DUMP ) like (val1a, val2a,.... ) (val1b, val2b,val3b,.... ) (val1c, val2c,.... ) . . . and I want to transform this into (val1a) (val2a) . . . (val1b) (val2b) (val3b) . . . (val1c) (val2c) . . . So, each column has to be "serialised" into rows and then these rows are added subsequently. Please note: I do not necessarily know how many columns are in each row.

Pig error 1070 when doing UDF

做~自己de王妃 提交于 2020-01-03 05:09:05
问题 I am trying to load up my own UDF in pig. I have made it into a jar using eclipse's export function. I am trying to run it locally so I can make sure it works before I put the jar on HDFS. When running it locally, I get the following error: ERROR 1070: Could not resolve myudfs.MONTH using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] Script REGISTER myudfs.jar; --DEFINE MONTH myudfs.MONTH; A = load 'access_log_Jul95' using PigStorage(' ') as (ip:chararray, dash1:chararray

ERROR 2998: Unhandled internal error. Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

扶醉桌前 提交于 2020-01-02 12:25:15
问题 I am new to hadoop. I was trying to integrate PIG with hive using Hcatalog but getting the below error during dump. Please let me know if any of you can help me out: A = load 'logs' using org.apache.hcatalog.pig.HCatLoader(); dump A ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected load and describe work fines but dump gives above error Details: hadoop-2.6.0 pig-0.14.0 hive-0.12.0

ERROR 2998: Unhandled internal error. Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

非 Y 不嫁゛ 提交于 2020-01-02 12:24:49
问题 I am new to hadoop. I was trying to integrate PIG with hive using Hcatalog but getting the below error during dump. Please let me know if any of you can help me out: A = load 'logs' using org.apache.hcatalog.pig.HCatLoader(); dump A ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected load and describe work fines but dump gives above error Details: hadoop-2.6.0 pig-0.14.0 hive-0.12.0

Sort related bag

流过昼夜 提交于 2020-01-02 04:41:07
问题 I have a Pig script which generated a relation A: {x: chararray,B: {(y: chararray,z: int)}} I want to sort A based on B.y, however the following piece gives me error: Syntax error, unexpected symbol at or near z output = foreach A{ sorted = order B by z DSC; generate x,sorted; } 回答1: Use DESC instead of DSC. e.g. output = foreach A{ sorted = order B by z DESC; generate x,sorted; } 来源: https://stackoverflow.com/questions/15144194/sort-related-bag

select count distinct using pig latin

Deadly 提交于 2020-01-01 04:20:13
问题 I need help with this pig script. I am just getting a single record. I am selecting 2 columns and doing a count(distinct) on another while also using a where like clause to find a particular description (desc). Here's my sql with pig I am trying to code. /* For example in sql: select domain, count(distinct(segment)) as segment_cnt from table where desc='ABC123' group by domain order by segment_count desc; */ A = LOAD 'myoutputfile' USING PigStorage('\u0005') AS ( domain:chararray, segment

select count distinct using pig latin

▼魔方 西西 提交于 2020-01-01 04:20:12
问题 I need help with this pig script. I am just getting a single record. I am selecting 2 columns and doing a count(distinct) on another while also using a where like clause to find a particular description (desc). Here's my sql with pig I am trying to code. /* For example in sql: select domain, count(distinct(segment)) as segment_cnt from table where desc='ABC123' group by domain order by segment_count desc; */ A = LOAD 'myoutputfile' USING PigStorage('\u0005') AS ( domain:chararray, segment

Filter a string on the basis of a word

冷暖自知 提交于 2020-01-01 03:11:51
问题 I have a pig job where in I need to filter the data by finding a word in it, Here is the snippet A = LOAD '/home/user/filename' USING PigStorage(','); B = FOREACH A GENERATE $27,$38; C = FILTER B BY ( $1 == '*Word*'); STORE C INTO '/home/user/out1' USING PigStorage(); The error is in the 3rd line while finding C, I have also tried using C = FILTER B BY $1 MATCHES '*WORD*' Also C = FILTER B BY $1 MATCHES '\\w+WORD\\w+' 回答1: MATCHES uses regular expressions. You should do ... MATCHES '.*WORD.*'

Loading JSON file with serde in Cloudera

百般思念 提交于 2019-12-31 07:18:15
问题 I am trying to work with a JSON file with this bag structure : { "user_id": "kim95", "type": "Book", "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.", "year": "1995", "publisher": "ACM Press and Addison-Wesley", "authors": [ { "name": "null" } ], "source": "DBLP" } { "user_id": "marshallo79", "type": "Book", "title": "Inequalities: Theory of Majorization and Its Application.", "year": "1979", "publisher": "Academic Press", "authors": [ { "name": "Albert W.