apache-pig | 易学教程

Hadoop - Load Hive tables using PIG

阅读更多关于 Hadoop - Load Hive tables using PIG

问题 I want to load Hive tables using Pig. I think we can do this through HCatLoader but I am using xml files to load pig. For this, I have to use XMLLoader . Can I use two options to load XML files in Pig. I am extracting data from XML files using my own UDF and once we extract all the data, I have to load Pig data in Hive tables. I can't use HIVE to extract the XML data as the XML I received is quite complex and I wrote my own UDF to parse the XML. Any suggestions or pointers how we can load

Pig ORDER command fails

阅读更多关于 Pig ORDER command fails

问题 I am trying to analyze an apache log and the goal is the find out all user agents and their percentage in usage. The following program works fine to the line when result contains each useragent, count and percentage. The program fails at last line when tries to order according to most used. Could someone help? logs = LOAD '$LOGS' USING ApacheCombinedLogLoader AS (remoteHost, hyphen, user, time, method, uri, protocol, statusCode, responseSize, referer, userAgent); uarows = FOREACH logs

Pig: Create json file with actual key_name and values

阅读更多关于 Pig: Create json file with actual key_name and values

问题 I have a pig script using elephant bird json loader. data_input = LOAD '$DATA_INPUT' USING com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map []); x = FOREACH data_input GENERATE json#'user__id_str', json#'user__created_at', json#'user__notifications', json#'user__follow_request_sent', json#'user__friends_count', json#'user__name', json#'user__time_zone', json#'user__profile_background_color', json#'user__is_translation_enabled', json#'user__profile_link_color', json#'user__utc

How to process a flat file with JSON string as a part of each line, into CSV file Using PIG Loader?

阅读更多关于 How to process a flat file with JSON string as a part of each line, into CSV file Using PIG Loader?

问题 I have a file in HDFS as 44,UK,{"names":{"name1":"John","name2":"marry","name3":"stuart"},"fruits":{"fruit1":"apple","fruit2":"orange"}},31-07-2016 91,INDIA,{"names":{"name1":"Ram","name2":"Sam"},"fruits":{}},31-07-2016 and want to store this into a SCV file as below using PIG loader : 44,UK,names,name1,John,31-07-2016 44,UK,names,name2,Marry,31-07-2016 .. 44,UK,fruit,fruit1,apple,31-07-2016 .. 91,INDIA,names,name1,Ram,31-07-2016 .. 91,INDIA,null,null,Ram,31-07-2016 What should be the PIG

Error getting when passing parameter through pig script

阅读更多关于 Error getting when passing parameter through pig script

问题 When I'm trying to invoke pig script with property file then I'm getting error: pig -P /mapr/ANALYTICS/apps/PigTest/pig.properties -f pig_if_condition.pig SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/mapr/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/mapr/hbase/hbase-0.98.4/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html

pig join with java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer

阅读更多关于 pig join with java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer

问题 I have two files, in data1 1 3 1 2 5 1 In data2 2 3 2 4 I then tried to read them into pig d1 = LOAD 'data1'; d2 = foreach d1 generate flatten(STRSPLIT($0, ' +')) as (f1:int,f2:int); d3 = LOAD 'data2' ; d4 = foreach d3 generate flatten(STRSPLIT($0, ' +')) as (f1:int,f2:int); data = join d2 by f1, d4 by f2; Then I got 2013-08-04 00:48:26,032 [Thread-21] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0005 java.lang.ClassCastException: java.lang.String cannot be cast to java.lang

REGEX_EXTRACT error in PIG

阅读更多关于 REGEX_EXTRACT error in PIG

问题 I have a CSV file with 3 columns: tweetid , tweet , and Userid . However within the tweet column there are comma separated values. i.e. of 1 row of data: `396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143 I want to extract all 3 fields individually, but REGEX_EXTRACT is giving me an error with this code: a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3); b = FILTER a BY REGEX_EXTRACT(f1,'(

Loading unstructered data with different delimiters in Pig using PigLatin only

阅读更多关于 Loading unstructered data with different delimiters in Pig using PigLatin only

问题 Hi I am trying to load the following data (inculdes different delimiters and is unstructered) into Pig using PigLatin only, without preparing the data with i.e. Java. Input: 1234 #one,#two,#three 5679 #one,#two 1234 #one Output what I am looking for: 1234 #one 1234 #two 1234 #three 5678 #one 5678 #two 1234 #one Any ideas? Is this even possible in Pig? Thanks a lot in advance! 回答1: Pig Script : A = LOAD 'a.csv' AS USING PigStorage(' ') (key:chararray, value:chararray); B = FOREACH A GENERATE

Loading datetime format files using PIG

阅读更多关于 Loading datetime format files using PIG

问题 I have a dataset in the following way. ravi,savings,avinash,2,char,33,F,22,44,12,13,33,44,22,11,10,22,2006-01-23 avinash,current,sandeep,3,char,44,M,33,11,10,12,33,22,39,12,23,19,2001-02-12 supreeth,savings,prabhash,4,char,55,F,22,12,23,12,44,56,7,88,34,23,1995-03-11 lavi,current,nirmesh,5,char,33,M,11,10,33,34,56,78,54,23,445,66,1999-06-15 Venkat,savings,bunny,6,char,11,F,99,12,34,55,33,23,45,66,23,23,2016-05-18 the last column(example:2006-01-23) is date. I am trying to load the above data

Use Pig to Denormalize A Large Data Frame

阅读更多关于 Use Pig to Denormalize A Large Data Frame

问题 I have a large-ish (21GB) tab-delimited data frame of the form DOCID_1 TERMID_1 TITLE_1 YEAR_1 AUTHOR_1 DOCID_1 TERMID_2 TITLE_1 YEAR_1 AUTHOR_1 ... DOCID_n TERMID_n TITLE_n YEAR_n AUTHOR_n That is, a (DOCID, TERMID) pair will always uniquely identify a row. What I need, is a data frame in which a DOCID alone uniquely identifies a row, and the TERMIDs are collapsed into a comma-separated chararray list. For example, DOCID_1 TERMID_11, TERMID_12, ..., TERMID_n TITLE_1 YEAR_1 AUTHOR_1 ... DOCID