apache-pig

Reading a file in javascript via Apache Pig UDF

删除回忆录丶 提交于 2019-12-25 11:56:36
问题 I have some (very simplified) nodejs code here: var fs = require('fs'); var derpfile = String(fs.readFileSync( './derp.txt', 'utf-8' )); var derps = derpfile.split( '\n' ); for (var i = 0; i < derps.length; ++i) { // do something with my derps here } The problem is, I cannot use node in Pig UDF's (that I am aware of; if I can do this, please let me know!). When I look at 'file io' in javascript, all the tutorials I see are in re the browser sandbox. I need to read a file off the filesystem,

ElephantBird ERROR 1070: — > class not getting read

江枫思渺然 提交于 2019-12-25 08:28:17
问题 My problem is similar to this unanswered question : [https://stackoverflow.com/questions/42140344/elephantbird-dependency-jars][1] i have registered all jars mandatory for elephantbird to function. REGISTER '/MyJARS/elephant-bird-hadoop-compat-4.1 REGISTER '/MyJARS/json-simple-1.1.jar'; REGISTER '/MyJARS/elephant-bird-pig-4.1.jar'; REGISTER '/MyJARS/elephant-bird-core-4.10.jar'; REGISTER '/MyJARS/google-collections-1.0.jar'; following links tell me these info : 1 : Loading data from HDFS does

Apache Pig - nested FOREACH over same relation

那年仲夏 提交于 2019-12-25 08:21:17
问题 I have a number of bags and I want to compute the pairwise similarities between the bags. sequences = FOREACH raw GENERATE gen_bag(logs); The relation is described as follows: sequences: {t: (type: chararray, value:charray)} The similarity is computed by a Python UDF that takes two bags as arguments. I have tried to do a nested foreach over the sequences variable, but I cant loop over the same relation twice. I've also tried to define the sequences twice, but I cant access the copy in the

Trouble running pig in both local or mapreduce mode

家住魔仙堡 提交于 2019-12-25 03:58:27
问题 I already have Hadoop 1.2 running on my Ubuntu VM which is running on Windows 7 machine. I recently installed Pig 0.12.0 on my same Ubuntu VM. I have downloaded the pig-0.12.0.tar.gz from the apache website. I have all the variables such as JAVA_HOME, HADOOP_HOME, PIG_HOME variables set correctly. When I try to start pig in local mode this is what I see: chandeln@ubuntu:~$ pig -x local pig: invalid option -- 'x' usage: pig chandeln@ubuntu:~$ echo $JAVA_HOME /usr/lib/jvm/java7 chandeln@ubuntu:

Apache Pig: Dynamic columns

强颜欢笑 提交于 2019-12-25 03:19:12
问题 I've a dataset (CSV) that has three value columns (v1, 2 and 3) with a value. The description of the value is stored as a comma separated string in the column 'keys'. | v1 | v2 | v3 | keys | | A | C | E | X,Y,Z | Using Pig I would like to load this information in a HBase table where the Column Family is C and the Column Qualifier is the key. | C:X | C:Y | C:Z | | A | C | E | Has anyone done this before and would like to share this knowledge? Another option is to store a map (key#value) in a

hadoop pig joining on any matching tuple values

前提是你 提交于 2019-12-25 02:38:11
问题 I'm new to pig and trying to use it to process a dataset. I have a set of records that looks like id elements -------------- 1 ["a","b","c"] 2 ["a","f","g"] 3 ["f","g","h"] The idea is that I want to create tuples of elements that have any overlapping elements. If elements was just a single item instead of array, I could do a simple join like: A = LOAD 'mydata' ... B = FOREACH A GENERATE id as id_2, elements as elements_2; C = JOIN A BY elements, B BY elements_2; But since elements is an

Error from Json Loader in Pig

血红的双手。 提交于 2019-12-25 02:05:56
问题 I have got below error while writing json scripts.. Please let me know how to write json loader script in pig. script: x = LOAD 'hdfs://user/spanda20/pig/phone.dat' USING JsonLoader('id:chararray, phone:(home:{(num:chararray, city:chararray)})'); Data set: { "id": "12345", "phone": { "home": [ { "zip": "23060", "city": "henrico" }, { "zip": "08902", "city": "northbrunswick" } ] } } 2015-03-18 14:24:10,917 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer

Pig Store the file with custom row/record delimiter

可紊 提交于 2019-12-24 17:23:03
问题 I Have a file with a ctrlB as a record delimiter. I was able to read the file in pig by over-writing LoaderInputFormat class and getInputFormat() method in pig storage. But I was not able to store the file with ctrlB as a record delimiter. 回答1: Read ctrl+b delimited record SET textinputformat.record.delimiter '\n' x= LOAD 'xyz' USING PigStorage('\u0002'); Write ctrl+b delimited record- store x into 'y' using PigStorage('\u0002'); 来源: https://stackoverflow.com/questions/38776692/pig-store-the

Is it possible to detect and handle string collisions among grouped values when grouping in Hadoop Pig?

风格不统一 提交于 2019-12-24 15:32:03
问题 Assuming I have lines of data like the following that show user names and their favorite fruits: Alice\tApple Bob\tApple Charlie\tGuava Alice\tOrange I'd like to create a pig query that shows the favorite fruit of each user. If a user appears multiple times, then I'd like to show "Multiple". For example, the result with the data above should be: Alice\tMultiple Bob\tApple Charlie\tGuava In SQL, this could be done something like this (although it wouldn't necessarily perform very well): select

How is Hadoop-3.0.0 's compatibility with older versions of Hive, Pig, Sqoop and Spark

我的未来我决定 提交于 2019-12-24 14:24:58
问题 We are currently using hadoop-2.8.0 on a 10 node cluster and are planning to upgrade to latest hadoop-3.0.0 . I want to know whether there will be any issue if we use hadoop-3.0.0 with an older version of Spark and other components such as Hive, Pig and Sqoop. 回答1: Latest Hive version does not support Hadoop3.0.It seems that Hive may be established on Spark or other calculating engines in the future. 来源: https://stackoverflow.com/questions/47920005/how-is-hadoop-3-0-0-s-compatibility-with