embedded hadoop-pig: what's the correct way to use the automatic addContainingJar for UDFs?

空扰寡人 提交于 2020-01-16 19:18:29

问题


when you use pigServer.registerFunction, you're not supposed to explicitly call pigServer.registerJar, but rather have pig automatically detect the jar using jarManager.findContainingJar.

However, we have a complex UDF who's class is dependent on other classes from multiple jars. So we created a jar-with-dependencies with the maven-assembly. But this causes the entire jar to enter pigContext.skipJars (as it contains the pig.jar itself) and not being sent to the hadoop server :(

What's the correct approach here? Must we manually call registerJar for every jar we depend on?


回答1:


not sure what's the certified way, but here's some pointers:

  • when you use pigServer.registerFunction pig automatically detects the jar that contain the udfs and sends it to the jobTracker
  • pig also automatically detects the jar that contains PigMapReduce class (JarManager.createJar), and extracts from it only the classes that start with org/apache/pig, org/antlr/runtime, etc. and sends them to the jobTracker as well
  • so, if your UDF sits in the same jar as PigMapReduce your'e screwed, because it won't get sent
  • our conclusion: don't use jar-with-dependencies

HTH



来源:https://stackoverflow.com/questions/8636222/embedded-hadoop-pig-whats-the-correct-way-to-use-the-automatic-addcontainingja

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!