问题
when you use pigServer.registerFunction, you're not supposed to explicitly call pigServer.registerJar, but rather have pig automatically detect the jar using jarManager.findContainingJar.
However, we have a complex UDF who's class is dependent on other classes from multiple jars. So we created a jar-with-dependencies with the maven-assembly. But this causes the entire jar to enter pigContext.skipJars (as it contains the pig.jar itself) and not being sent to the hadoop server :(
What's the correct approach here? Must we manually call registerJar for every jar we depend on?
回答1:
not sure what's the certified way, but here's some pointers:
- when you use
pigServer.registerFunction
pig automatically detects the jar that contain the udfs and sends it to the jobTracker - pig also automatically detects the jar that contains PigMapReduce class (
JarManager.createJar
), and extracts from it only the classes that start withorg/apache/pig
,org/antlr/runtime
, etc. and sends them to the jobTracker as well - so, if your UDF sits in the same jar as
PigMapReduce
your'e screwed, because it won't get sent - our conclusion: don't use jar-with-dependencies
HTH
来源:https://stackoverflow.com/questions/8636222/embedded-hadoop-pig-whats-the-correct-way-to-use-the-automatic-addcontainingja