apache-pig

Escape special characters in Apache pig data

二次信任 提交于 2019-12-24 14:17:49
问题 I am using Apache Pig to process some data. My data set has some strings that contain special characters i.e (#,{}[]) . This programming pig book says that you can't escape those characters. So how can I process my data without deleting the special characters? I thought about replacing them but would like to avoid that. Thanks 回答1: Have you tried loading your data? There is no way to escape these characters when they are part of the values in a tuple, bag, or map , but there is no problem

Encountered IOException while registering python UDF in pig. File helloworld.py does not exist

拜拜、爱过 提交于 2019-12-24 13:52:10
问题 Pytjon UDF : @outputSchema("word:chararray") def helloworld(): return 'Hello, World' register '/user/hdfs/helloworld.py' using jython as myfunc; Error: grunt> REGISTER 'helloworld.py' USING org.apache.pig.scripting.jython.JythonScriptEngine as myfuncs; 2016-05-16 12:08:04,909 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Encountered IOException. File helloworld.py does not exist 2016-05-16 12:08:04,909 [main] WARN org.apache.pig.tools.grunt.Grunt - There is no log file to write

Python UDFs in Pig

一世执手 提交于 2019-12-24 13:43:02
问题 I've seen the documentatio here, but I confess that I feel it rather lacking. I was wondering if anyone could give me collection of examples as to incorporating Python UDFs into Pig. In particular Prior to Pig 0.10, the boolean type does not exist, but a FILTER operation requires the result resolve to a boolean. Am I forever cursed with returning 1 or 0 and using FILTER alias BY py_udf.f(field) > 0 if I don't have the latest version? Are the Algebraic , Accumulator , and Filter interfaces

Pig pass relation as argument to UDF

这一生的挚爱 提交于 2019-12-24 13:26:33
问题 I need to pass a relation to a UDF in PIG articles = load x using ...; groupedArticles = udfs.MyUDF(articles); Is something like this possible? Any workaround? thanks 回答1: I guess you mean to pass all fields of the relation to the UDF? Passing the relation would not make sense. In any case this depends on how your load statement looks like. If you load each entry as a tuple load x using ... as (entry:(a:int, b:chararray, ...)) than you could pass that to the UDF like groupedArticles = foreach

How do I ignore brackets when loading exteral table in HIVE

怎甘沉沦 提交于 2019-12-24 13:09:26
问题 I'm trying to load an extract of a pig script as an external table in HIVE. Pig enclosed each row between brackets () (tuples?) like this: (1,2,3,a) (2,4,5,b) (4,2,6,c) and I can't find a way to tell HIVE to ignore those brackets which results in null values for the first column as it is actually an integer. Any thoughts on how to proceed? I know I can use a FLATTEN command in PIG but I would also like to learn how to deal with these files directly from HIVE. 回答1: There is no way to do this

PIG - retrieve data from XML using XPATH

假装没事ソ 提交于 2019-12-24 11:34:53
问题 I have n number of these type of xml files. <students roll_no=1> <name>abc</name> <gender>m</gender> <maxmarks> <marks> <year>2014</year> <maths>100</maths> <english>100</english> <spanish>100</spanish> <marks> <marks> <year>2015</year> <maths>110</maths> <english>110</english> <spanish>110</spanish> <marks> </maxmarks> <marksobt> <marks> <year>2014</year> <maths>90</maths> <english>95</english> <spanish>82</spanish> <marks> <marks> <year>2015</year> <maths>94</maths> <english>98</english>

RANK inside the bag?

随声附和 提交于 2019-12-24 09:19:56
问题 Let's say I have set_of_values : a, k a, l a, m b, x b, y b, z If I use a = RANK set_of_values; I get: 1, a, k 2, a, l 3, a, m 4, b, x 5, b, y 6, b, z What I would like to achieve is RANK, but inside the group. First : a = group set_of_values by first_value; (a,{(a,k),(a,l),(a,m)}) (b,{(b,x),(b,y),(b,z)}) And what should I do now to get: (a,{(1,a,k),(2,a,l),(3,a,m)}) (b,{(1,b,x),(2,b,y),(3,b,z)}) EDIT (added RANK inside foreach) b = foreach a { c = RANK $1; generate c; } I get: 2014-03-05 09

PIG REGEX_EXTRACT ALL function -> no results

北城余情 提交于 2019-12-24 08:16:30
问题 I have been encountering an issue for several hours already. I have a .csv file with JSON strings inside. Every column in that .csv contains a string with several JSON objects. I imported several columns into PigStorage. Worked so far. Then I tried to extract the JSON objects which have the following form: [{"tmestmp":"2014-05-14T07:01:00","Value":0,"Quality":1},{"tmestmp":"2014-05-14T07:01:00.02","Value":10,"Quality":4},{"tmestmp":"2014-05-14T07:01:00.04","Value":17,"Quality":9},{"tmestmp":

cassandra/hadoop/pig design for loading and processing data

倾然丶 夕夏残阳落幕 提交于 2019-12-24 06:52:15
问题 I have a setup of Hadoop,Cassandra, Pig, Mysql My goal is to read 1 month data from cassandra process it and put result to mysql periodically. What is the best practice to do.? Is it i need to load all the data and filter in pig for 1 month or filter while loading from cassandra using pig/cql(using CqlStorage). Here the problem is, if i need to filter while loading from cassandra pig has a bug of having where clause on cql(https://issues.apache.org/jira/browse/CASSANDRA-6151). or problem with

Cutting down bag to pass to udf

China☆狼群 提交于 2019-12-24 04:23:09
问题 Using Pig on a Hadoop cluster, I have a huge bag of huge tuples which I regularly add fields to as I continue to work on this project, and several UDFs which use various fields from it. I want to be able to call a UDF on just a few fields from each tuple and reconnect the result to that particular tuple . Doing a join to reconnect the records using unique ids takes forever on billions of records. I think there should be a way to do this all inside the GENERATE statement, but I can't find the