问题
I've seen the documentatio here, but I confess that I feel it rather lacking. I was wondering if anyone could give me collection of examples as to incorporating Python UDFs into Pig. In particular
- Prior to Pig 0.10, the boolean type does not exist, but a
FILTER
operation requires the result resolve to a boolean. Am I forever cursed with returning1
or0
and usingFILTER alias BY py_udf.f(field) > 0
if I don't have the latest version? - Are the
Algebraic
,Accumulator
, andFilter
interfaces inaccessible from Python? - Can I not access the Distributed Cache either?
- What about Store/Load functions?
回答1:
Python UDFs are quite limited. You cannot use Algebraic or Accumulator interfaces, nor can you write a LoadFunc in Python. For anything more complicated than a map operation you will likely need to resort to a Java UDF.
That said, a more complex Python UDF with a dynamic outputSchema can be found at http://ragrawal.wordpress.com/2013/02/24/on-writing-python-udf-for-pig-a-perspective/. This likely won't help you, but it will give you a better understanding of what Python UDFs can do.
回答2:
This may not answer most of your specific questions, but this blog post and linked code contains several good examples of using Pig with Python, and does include usage of Store/Load and their interaction with Python.
来源:https://stackoverflow.com/questions/10808838/python-udfs-in-pig