avoiding prefixes in multi relation join in pig

孤街浪徒 提交于 2019-12-08 11:32:28

By adding extra foreach to the joins you can slightly simplify the aliases. Check the statistics, this won't add extra MR jobs to the pipeline. The original and this will yield to 4 map-only jobs.

E.g:

H86 = foreach (JOIN hs_8_d BY hs_6, hs_6_d BY hs_6 USING 'replicated') generate 
        hs_8_d::hs_2 as x1, 
        hs_8_d::hs_4 as x2, 
        hs_8_d::hs_6 as x3, 
        hs_8_d::hs_8 as x4,
        hs_8_d::hs_8_desc as x5, 
        hs_6_d::hs_6 as x6,
        hs_6_d::hs_6_desc as x7;

H864 = foreach (JOIN H86 BY x2, hs_4_d BY hs_4 USING 'replicated') generate 
          H86::x1 as y1,
          H86::x2 as y2, 
          H86::x3 as y3,
          H86::x4 as y4,
          H86::x5 as y5, 
          H86::x6 as y6, 
          H86::x7 as y7,
          hs_4_d::hs_4 as y8,
          hs_4_d::hs_4_desc as y9;

H8642 = foreach (JOIN H864 BY y1, hs_2_d BY hs_2 USING 'replicated') generate 
          H864::y1 as z1, 
          H864::y2 as z2,
          H864::y3 as z3, 
          H864::y4 as z4, 
          H864::y5 as z5, 
          H864::y6 as z6, 
          H864::y7 as z7,
          H864::y8 as z8, 
          H864::y9 as z9, 
          hs_2_d::hs_2 as z10, 
          hs_2_d::hs_2_desc as z11;

hs_dim = FOREACH H8642 GENERATE z10, z11, z8, z9, z6, z7, z4, z5;

If you have a bag of tuples, then Datafu's AliasBagFields may be helpful.

Pig will always prefixes fields with bagname:: to disambiguate fields after joins. I don't think you can avoid this unfortunately.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!