apache-pig

strsplit issue - Pig

故事扮演 提交于 2019-12-09 14:55:18
问题 I have following tuple H1 and I want to strsplit its $0 into tuple.However I always get an error message: DUMP H1: (item32;item31;,1) m = FOREACH H1 GENERATE STRSPLIT($0, ";", 50); ERROR 1000: Error during parsing. Lexical error at line 1, column 40. Encountered: after : "\";" Anyone knows what's wrong with the script? 回答1: There is an escaping problem in the pig parsing routines when it encounters this semicolon. You can use a unicode escape sequence for a semicolon: \u003B . However this

Loading Raw JSON into Pig

最后都变了- 提交于 2019-12-09 13:21:08
问题 I have a file where each line is a JSON object (actually, it's a dump of stackoverflow). I would like to load this into Apache Pig as easily as possible, but I am having trouble figuring out how I can tell Pig what the input format is. Here's an example of an entry, { "_id" : { "$oid" : "506492073401d91fa7fdffbe" }, "Body" : "....", "ViewCount" : 7351, "LastEditorDisplayName" : "Rich B", "Title" : ".....", "LastEditorUserId" : 140328, "LastActivityDate" : { "$date" : 1314819738077 },

How do I get schema / column names from parquet file?

余生长醉 提交于 2019-12-09 07:50:07
问题 I have a file stored in HDFS as part-m-00000.gz.parquet I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the .parquet extension. How do I get the schema / column names for this file? 回答1: You won't be able "open" the file using a hdfs dfs -text because its not a text file. Parquet files are written to disk very differently compared to text files. And for the

Using Regex in Pig in hadoop

∥☆過路亽.° 提交于 2019-12-09 03:54:26
问题 I have a CSV file containing user (tweetid, tweets, userid). 396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09 396124436740317184,"“@BleacherReport: Halloween has given us this amazing Derrick Rose photo (via @amandakaschube, @ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos 396124436845178880,"When's 12.4k gonna roll around",Matty_T_03 Now I need to write a Pig Query that returns

How to prevent Apache pig from outputting empty files?

被刻印的时光 ゝ 提交于 2019-12-08 19:16:30
I have a pig script that reads data from a directory on HDFS. The data are stored as avro files. The file structure looks like: DIR-- --Subdir1 --Subdir2 --Subdir3 --Subdir4 In the pig script I am simply doing a load, filter and store. It looks like: items = LOAD path USING AvroStorage() items = FILTER items BY some property STORE items into outputDirectory using AvroStorage() The problem right now is that pig is outputting many empty files in the output directory. I am wondering if there's a way to remove those files? Thanks! For pig version 0.13 and later, you can set pig.output.lazy=true to

Projecting Grouped Tuples in Pig

旧街凉风 提交于 2019-12-08 19:08:46
问题 I have a collection of tuples of the form (t,a,b) that I want to group by b in Pig. Once grouped, I want to filter out b from the tuples in each group and generate a bag of filtered tuples per group. As an example, assume we have (1,2,1) (2,0,1) (3,4,2) (4,1,2) (5,2,3) The pig script would produce {(1,2),(2,0)} {(3,4),(4,1)} {(5,2)} The question is: how do I go about producing this result? I'm used to seeing examples where aggregation operations follow a group by operation. It's less clear to

Java Pig Latin sentence translator using Queues

坚强是说给别人听的谎言 提交于 2019-12-08 14:24:24
I am very new to Java and am trying to create a program to translate a sentence into Pig Latin, moving the first letter of the word to the end and appending "y" at the end if the first letter was a vowel and "ay" at the end otherwise. I am required to use a queue for this. Currently my program is just terminating and I was wondering if anyone might be able to spot where I am going wrong or where to head next. Thanks! import MyQueue.QueueList; import java.util.Scanner; public class PigLatin { public static void main (String[] args) { Scanner scan = new Scanner (System.in); QueueList word = new

pig programming to use split on group by having count(*)

我与影子孤独终老i 提交于 2019-12-08 13:46:40
问题 Input file is: 2, cornflakes, Regular,General Mills, 12 3, cornflakes, Mixed Nuts, Post, 14 4, chocolate syrup, Regular, Hersheys, 5 5, chocolate syrup, No High Fructose, Hersheys, 8 6, chocolate syrup, Regular, Ghirardeli, 6 7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7 filter3 = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int); SPLIT filter3 INTO filter4 IF (FOREACH (filter3 GROUP BY item)

How to have Pig store rows in HBase as strings not bytes?

ⅰ亾dé卋堺 提交于 2019-12-08 13:41:40
问题 If I use the hbase shell and issue: put 'test', 'rowkey1','cf:foo', 'bar' scan 'test' I will see the result as a string, not in bytes. If I use happybase and issue: import happybase connection = happybase.Connection('<hostname>') table = connection.table('test') table.put('rowkey2', {'cf:foo': 'bar'}) for row in table.scan(): print row I will see the result as a string, not in bytes. I have data in hive that I ran an aggregation on and stored on HDFS via: INSERT OVERWRITE DIRECTORY

How can I do this inner join properly in Apache PIG?

大憨熊 提交于 2019-12-08 12:22:55
问题 I have two files, one called a-records 123^record1 222^record2 333^record3 and the other file called b-records 123^jim 123^jim 222^mike 333^joe you can see in file A that I have the token 123 one time. In file B it's in there twice. Is there a way using Apache PIG I can join the data such that I only get ONE joined record from the A file? here is my current script which outputs the following below arecords = LOAD '$a' USING PigStorage('^') as (token:chararray, type:chararray); brecords = LOAD