Pig Script: Join with multiple files

痞子三分冷 提交于 2019-12-01 09:47:20

问题


I am reading a big file (more than a billion records) and joining it with three other files, I was wondering if there is anyway the process can be made more efficient to avoid multiple reads on the big table.The smalltables may not fit in memory.

A = join smalltable1 by  (f1,f2) RIGHT OUTER,massive by (f1,f2) ;
B = join smalltable2 by  (f3) RIGHT OUTER, A by (f3) ;
C = join smalltable3 by  (f4) ,B by (f4) ;

The alternative that I was thinking is to write a udf and replace values in one read, but I am not sure if a udf would be efficient since the small files won't fit in the memory. The implementation could be like:

A = LOAD massive 
B = generate f1,udfToTranslateF1(f1),f2,udfToTranslateF2(f2),f3,udfToTranslateF3(f3)

Appreciate your thoughts...


回答1:


Pig 0.10 introduced integration with Bloom Filters http://search-hadoop.com/c/Pig:/src/org/apache/pig/builtin/Bloom.java%7C%7C+%2522done+%2522exec+Tuple%2522

You can train a bloom filter on the 3 smaller files and filter big file, hopefully it will result in a smaller file. After that perform standard joins to get 100% precision.

UPDATE 1 You would actually need to train 2 Bloom Filters, one for each of the small tables, as you join on different keys.

UPDATE 2 It was mentioned in the comments that the outer join is used for augmenting data. In this case Bloom Filters might not be the best thing, they are good for filtering and not adding data in outer joins, as you want to keep the non matched data. A better approach would be to partition all small tables on respective fields (f1, f2, f3, f4), store each partition into a separate file small enough to load into memory. Than Group BY massive table on f1, f2, f3, f4 and in a FOREACH pass the group (f1, f2, f3, f4) with associated bag to the custom function written in Java, that loads the respective partitions of the small files into RAM and performs augmentation.



来源:https://stackoverflow.com/questions/12390252/pig-script-join-with-multiple-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!