apache-pig | 易学教程

strsplit issue - Pig

阅读更多关于 strsplit issue - Pig

问题 I have following tuple H1 and I want to strsplit its $0 into tuple.However I always get an error message: DUMP H1: (item32;item31;,1) m = FOREACH H1 GENERATE STRSPLIT($0, ";", 50); ERROR 1000: Error during parsing. Lexical error at line 1, column 40. Encountered: after : "\";" Anyone knows what's wrong with the script? 回答1: There is an escaping problem in the pig parsing routines when it encounters this semicolon. You can use a unicode escape sequence for a semicolon: \u003B . However this

Loading Raw JSON into Pig

阅读更多关于 Loading Raw JSON into Pig

问题 I have a file where each line is a JSON object (actually, it's a dump of stackoverflow). I would like to load this into Apache Pig as easily as possible, but I am having trouble figuring out how I can tell Pig what the input format is. Here's an example of an entry, { "_id" : { "$oid" : "506492073401d91fa7fdffbe" }, "Body" : "....", "ViewCount" : 7351, "LastEditorDisplayName" : "Rich B", "Title" : ".....", "LastEditorUserId" : 140328, "LastActivityDate" : { "$date" : 1314819738077 },

How do I get schema / column names from parquet file?

阅读更多关于 How do I get schema / column names from parquet file?

问题 I have a file stored in HDFS as part-m-00000.gz.parquet I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the .parquet extension. How do I get the schema / column names for this file? 回答1: You won't be able "open" the file using a hdfs dfs -text because its not a text file. Parquet files are written to disk very differently compared to text files. And for the

Using Regex in Pig in hadoop

阅读更多关于 Using Regex in Pig in hadoop

问题 I have a CSV file containing user (tweetid, tweets, userid). 396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09 396124436740317184,"“@BleacherReport: Halloween has given us this amazing Derrick Rose photo (via @amandakaschube, @ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos 396124436845178880,"When's 12.4k gonna roll around",Matty_T_03 Now I need to write a Pig Query that returns

How to prevent Apache pig from outputting empty files?

阅读更多关于 How to prevent Apache pig from outputting empty files?

I have a pig script that reads data from a directory on HDFS. The data are stored as avro files. The file structure looks like: DIR-- --Subdir1 --Subdir2 --Subdir3 --Subdir4 In the pig script I am simply doing a load, filter and store. It looks like: items = LOAD path USING AvroStorage() items = FILTER items BY some property STORE items into outputDirectory using AvroStorage() The problem right now is that pig is outputting many empty files in the output directory. I am wondering if there's a way to remove those files? Thanks! For pig version 0.13 and later, you can set pig.output.lazy=true to

Projecting Grouped Tuples in Pig

阅读更多关于 Projecting Grouped Tuples in Pig

问题 I have a collection of tuples of the form (t,a,b) that I want to group by b in Pig. Once grouped, I want to filter out b from the tuples in each group and generate a bag of filtered tuples per group. As an example, assume we have (1,2,1) (2,0,1) (3,4,2) (4,1,2) (5,2,3) The pig script would produce {(1,2),(2,0)} {(3,4),(4,1)} {(5,2)} The question is: how do I go about producing this result? I'm used to seeing examples where aggregation operations follow a group by operation. It's less clear to

Java Pig Latin sentence translator using Queues

阅读更多关于 Java Pig Latin sentence translator using Queues

I am very new to Java and am trying to create a program to translate a sentence into Pig Latin, moving the first letter of the word to the end and appending "y" at the end if the first letter was a vowel and "ay" at the end otherwise. I am required to use a queue for this. Currently my program is just terminating and I was wondering if anyone might be able to spot where I am going wrong or where to head next. Thanks! import MyQueue.QueueList; import java.util.Scanner; public class PigLatin { public static void main (String[] args) { Scanner scan = new Scanner (System.in); QueueList word = new

pig programming to use split on group by having count(*)

阅读更多关于 pig programming to use split on group by having count(*)

问题 Input file is: 2, cornflakes, Regular,General Mills, 12 3, cornflakes, Mixed Nuts, Post, 14 4, chocolate syrup, Regular, Hersheys, 5 5, chocolate syrup, No High Fructose, Hersheys, 8 6, chocolate syrup, Regular, Ghirardeli, 6 7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7 filter3 = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int); SPLIT filter3 INTO filter4 IF (FOREACH (filter3 GROUP BY item)

How to have Pig store rows in HBase as strings not bytes?

阅读更多关于 How to have Pig store rows in HBase as strings not bytes?

问题 If I use the hbase shell and issue: put 'test', 'rowkey1','cf:foo', 'bar' scan 'test' I will see the result as a string, not in bytes. If I use happybase and issue: import happybase connection = happybase.Connection('<hostname>') table = connection.table('test') table.put('rowkey2', {'cf:foo': 'bar'}) for row in table.scan(): print row I will see the result as a string, not in bytes. I have data in hive that I ran an aggregation on and stored on HDFS via: INSERT OVERWRITE DIRECTORY

How can I do this inner join properly in Apache PIG?

阅读更多关于 How can I do this inner join properly in Apache PIG?

问题 I have two files, one called a-records 123^record1 222^record2 333^record3 and the other file called b-records 123^jim 123^jim 222^mike 333^joe you can see in file A that I have the token 123 one time. In file B it's in there twice. Is there a way using Apache PIG I can join the data such that I only get ONE joined record from the A file? here is my current script which outputs the following below arecords = LOAD '$a' USING PigStorage('^') as (token:chararray, type:chararray); brecords = LOAD