apache-pig

file formats that can be read using PIG

大憨熊 提交于 2019-12-10 14:38:39
问题 What kind of file formats can be read using PIG? How can I store them in different formats? Say we have CSV file and I want to store it as MXL file how this can be done? Whenever we use STORE command it makes directory and it stores file as part-m-00000 how can I change name of the file and overwrite directory? 回答1: what kind of file formats can be read using PIG? how can i store them in different formats? There are a few built-in loading and storing methods, but they are limited: BinStorage

Pig not loading data into HCatalog table - HortonWorks Sandbox [closed]

ぐ巨炮叔叔 提交于 2019-12-10 12:03:37
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I am running a Pig script in the HortonWorks virtual machine with the goal of extracting certain parts of my XML dataset, and loading those parts into columns in an HCatalog table. On my local machine, I run my Pig script on the XML file and get an output file with all the extracted parts. However, for some

Hadoop Pig UDF invocation issue

和自甴很熟 提交于 2019-12-10 11:58:28
问题 The following code works quite well, but when I already have two existing bags (with their alias, suppose S1 and S2 for representing two existing bags for two sets), wondering how to call UDF setDifference to generate set differences? I think if I manually construct an additional bag, using my already existing input bags (S1 and S2), it will be additional overhead? register datafu-1.2.0.jar; define setDifference datafu.pig.sets.SetDifference(); -- ({(3),(4),(1),(2),(7),(5),(6)} \t {(1),(3),(5

Pig 3rd party UDF clarification

落爺英雄遲暮 提交于 2019-12-10 11:57:50
问题 I am new to PIG. From the pig wiki page i got to know that there is piggybank udf and another useful collection DataFu from Linkedin. Also i come to know that from Pig 0.8 the piggybank is part of apache Pig's builtin udfs. but.. I think most of the piggybank UDFs are not documented in Apache Pig. Like StringConcat. I am looking some date formatting UDFs which wil convert datetime to String like FormatDate. I am not sure we have these UDF's already in pig/piggybank as i could not find it in

Restricting loading of log files in Pig Latin based on interested date range as parameter input

筅森魡賤 提交于 2019-12-10 11:52:10
问题 I'm having a problem loading log files based on parameter input and was wondering whether someone would be able to provide some guidance. The logs in question are Omniture logs, stored in subdirectories based on year, month, and day (eg. /year=2013/month=02/day=14), and with the date stamp in the filename. For any day, multiple logs could exist, each hundreds of MB. I have a Pig script which currently processes logs for an entire month, with the month and the year specified as script

Pig: Pivoting & Sum 3 relations

ぃ、小莉子 提交于 2019-12-10 11:47:15
问题 i have 3 different relations as mentioned below & i can get the output using UDFs but looking for implementation in PIG. Referred other stuff in the forums but not getting concrete idea over this problem. Proc: FN1,10 FN2,20 FN3,23 FN4,25 FN5,15 FN7,40 FN10,56 Rej: FN1,12 FN2,13 FN3,33 FN6,60 FN8,23 FN9,44 FN10,4 AllFN: FN1 FN2 FN3 FN4 FN5 FN6 FN7 FN8 FN9 FN10 Output required is: FN1,10,12,22 FN2,20,13,33 FN3,23,33,56 FN4,25,0,25 FN5,15,0,15 FN6,0,60,60 FN7,40,0,40 FN8,0,23,23 FN9,0,44,44

avoiding prefixes in multi relation join in pig

六眼飞鱼酱① 提交于 2019-12-10 11:42:14
问题 I am trying to do a star schema type of join in pig and below is my code. When I join multiple relations with different columns, I have to prefix the name of the previous join every time to get it working. I am sure there should be some better way, I am not able to find it through googling. Any pointers will be very helpful. i.e prefixing a column like this "H864::H86::hs_8_d::hs_8_desc" is what I want to avoid. hs_8 = LOAD 'hs_8_distinct' USING PigStorage('^') as (hs_8:chararray,hs_8_desc

Execute Pig from within Java Application

こ雲淡風輕ζ 提交于 2019-12-10 10:47:58
问题 Is it possible to run Apache Pig jobs from within a Java application, without forking an external process? It seems both Pig and Hadoop are written in Java but don't really offer Java APIs. Rather than relying on shell scripts, I'd rather use these tools form within a Java Spring application. 回答1: See Spring Hadoop project and its Pig support. 回答2: It seems there is Java API for Pig. According to this API, there is a PigRunner class. With that, you could easily add it to your Spring

generating bigram combinations from grouped data in pig

冷暖自知 提交于 2019-12-10 10:23:03
问题 given my input data in userid,itemid format: raw: {userid: bytearray,itemid: bytearray} dump raw; (A,1) (A,2) (A,4) (A,5) (B,2) (B,3) (B,5) (C,1) (C,5) grpd = GROUP raw BY userid; dump grpd; (A,{(A,1),(A,2),(A,4),(A,5)}) (B,{(B,2),(B,3),(B,5)}) (C,{(C,1),(C,5)}) I'd like to generate all of the combinations(order not important) of items within each group. I eventually intend on performing jaccard similarity on the items in my group. ideally my the bigrams would be generated and then I'd

Passing a filename to Java UDF from Pig using distributed cache

喜欢而已 提交于 2019-12-10 09:59:12
问题 I am using a small map file in my Java UDF function and I want to pass the filename of this file from Pig through the constructor. Following is the relevant part from my UDF function public GenerateXML(String mapFilename) throws IOException { this(null); } public GenerateXML(String mapFilename) throws IOException { if (mapFilename != null) { // do preocessing } } In the Pig script I have the following line DEFINE GenerateXML com.domain.GenerateXML('typemap.tsv'); This works in local mode, but