apache-pig

STORE output to a single CSV?

☆樱花仙子☆ 提交于 2019-12-19 05:21:24
问题 Currently, when I STORE into HDFS, it creates many part files. Is there any way to store out to a single CSV file? 回答1: You can do this in a few ways: To set the number of reducers for all Pig opeations, you can use the default_parallel property - but this means every single step will use a single reducer, decreasing throughput: set default_parallel 1; Prior to calling STORE, if one of the operations execute is (COGROUP, CROSS, DISTINCT, GROUP, JOIN (inner), JOIN (outer), and ORDER BY), then

Running Pig query over data stored in Hive

*爱你&永不变心* 提交于 2019-12-18 12:06:01
问题 I would like to know how to run Pig queries stored in Hive format. I have configured Hive to store compressed data (using this tutorial http://wiki.apache.org/hadoop/Hive/CompressedStorage). Before that I used to just use normal Pig load function with Hive's delimiter (^A). But now Hive stores data in sequence files with compression. Which load function to use? Note that don't need close integration like mentioned here: Using Hive with Pig, just what load function to use to read compressed

Error in pig while loading data

让人想犯罪 __ 提交于 2019-12-18 11:31:28
问题 I am using ubuntu 12.02 32bit and have installed hadoop2.2.0 and pig 0.12 successfully. Hadoop runs properly on my system. However, whenever I run this command : data = load 'atoz.csv' using PigStorage(',') as (aa1:int, bb1:int, cc1:int, dd1:chararray); dump data; I'm getting the following error : ERROR org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl - Error whiletrying to run jobs.java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class

Error in pig while loading data

可紊 提交于 2019-12-18 11:31:05
问题 I am using ubuntu 12.02 32bit and have installed hadoop2.2.0 and pig 0.12 successfully. Hadoop runs properly on my system. However, whenever I run this command : data = load 'atoz.csv' using PigStorage(',') as (aa1:int, bb1:int, cc1:int, dd1:chararray); dump data; I'm getting the following error : ERROR org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl - Error whiletrying to run jobs.java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class

Load File delimited by double colon :: in pig

风流意气都作罢 提交于 2019-12-18 07:23:26
问题 Following is a sample dataset delimited by double colon(::). 1::Toy Story (1995)::Animation|Children's|Comedy I want to extract three fields from above data set as movieID,title and genre. I have written following code for that movies = LOAD 'location/of/dataset/on/hdfs ' using PigStorage('::') as (MovieID:int,title:chararray,genre:chararray); But i am getting following error ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <file script.pig, line 1, column 9>

CDH4 Hbase using Pig ERROR 2998 java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/Filter

China☆狼群 提交于 2019-12-18 06:17:19
问题 I am using CDH4 in a pseudo-distributed mode and I have some trouble working with HBase and Pig together (but both work fine alone). I am following step by step this nice tutorial: http://blog.whitepages.com/2011/10/27/hbase-storage-and-pig/ So my Pig-script looks like this register /usr/lib/zookeeper/zookeeper-3.4.3-cdh4.1.2.jar register /usr/lib/hbase/hbase-0.92.1-cdh4.1.2-security.jar register /usr/lib/hbase/lib/guava-11.0.2.jar raw_data = LOAD 'input.csv' USING PigStorage( ',' ) AS (

Run pig in java without embedding pig script

早过忘川 提交于 2019-12-18 05:13:44
问题 I am new to pig script, Hadoop, Hbase. Here's what i need to know. I wanted to run a pig script, I don't want to embed the pig script in my java program and wanted to run it through any Pig Execution methods passing the necessary pig script and parameters (possibly parameter file). Does the core pig library or any other library provides that way to execute a pig script. I already tried with java run-time exec method, I pass some parameters with space separated strings so i dropped calling pig

How can I add row numbers for rows in PIG or HIVE?

六月ゝ 毕业季﹏ 提交于 2019-12-18 04:20:12
问题 I have a problem when adding row numbers using Apache Pig. The problem is that I have a STR_ID column and I want to add a ROW_NUM column for the data in STR_ID, which is the row number of the STR_ID. For example, here is the input: STR_ID ------------ 3D64B18BC842 BAECEFA8EFB6 346B13E4E240 6D8A9D0249B4 9FD024AA52BA How do I get the output like: STR_ID | ROW_NUM ---------------------------- 3D64B18BC842 | 1 BAECEFA8EFB6 | 2 346B13E4E240 | 3 6D8A9D0249B4 | 4 9FD024AA52BA | 5 Answers using Pig

Export from pig to CSV

最后都变了- 提交于 2019-12-18 02:50:15
问题 I'm having a lot of trouble getting data out of pig and into a CSV that I can use in Excel or SQL (or R or SPSS etc etc) without a lot of manipulation ... I've tried using the following function: STORE pig_object INTO '/Users/Name/Folder/pig_object.csv' USING CSVExcelStorage(',','NO_MULTILINE','WINDOWS'); It creates the folder with that name with lots of part-m-0000# files. I can later join them all up using cat part* > filename.csv but there's no header which means I have to put it in

Computing median in map reduce

若如初见. 提交于 2019-12-17 23:43:04
问题 Can someone example the computation of median/quantiles in map reduce? My understanding of Datafu's median is that the 'n' mappers sort the data and send the data to "1" reducer which is responsible for sorting all the data from n mappers and finding the median(middle value) Is my understanding correct?, if so, does this approach scale for massive amounts of data as i can clearly see the one single reducer struggling to do the final task. Thanks 回答1: Trying to find the median (middle number)