问题
I am using a small map file in my Java UDF function and I want to pass the filename of this file from Pig through the constructor.
Following is the relevant part from my UDF function
public GenerateXML(String mapFilename) throws IOException {
this(null);
}
public GenerateXML(String mapFilename) throws IOException {
if (mapFilename != null) {
// do preocessing
}
}
In the Pig script I have the following line
DEFINE GenerateXML com.domain.GenerateXML('typemap.tsv');
This works in local mode, but not in distributed mode. I am passing the following parameters to Pig in command line
pig -Dmapred.cache.files="/path/to/typemap.tsv#typemap.tsv" -Dmapred.create.symlink=yes -f generate-xml.pig
And I am getting the following exception
2013-01-11 10:39:42,002 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<file generate-xml.pig, line 16, column 42> Failed to generate logical plan. Nested exception: java.lang.RuntimeException: could not instantiate 'com.domain.GenerateXML' with arguments '[typemap.tsv]'
Any idea what I need to change to make it work?
回答1:
The problem is solved now.
It seems that when I run the Pig script using following parameters
pig -Dmapred.cache.files="/path/to/typemap.tsv#typemap.tsv" -Dmapred.create.symlink=yes -f generate-xml.pig
The /path/to/typemap.tsv
should be the local path and not a path in HDFS.
回答2:
You can use getCacheFiles
function in a Pig UDF and it will be enough - you don't have to use any additional properties like mapred.cache.files
. Your case can be implemented like this:
public class UdfCacheExample extends EvalFunc<Tuple> {
private Dictionary dictionary;
private String pathToDictionary;
public UdfCacheExample(String pathToDictionary) {
this.pathToDictionary = pathToDictionary;
}
@Override
public Tuple exec(Tuple input) throws IOException {
Dictionary dictionary = getDictionary();
return createSomething(input);
}
@Override
public List<String> getCacheFiles() {
return Arrays.asList(pathToDictionary);
}
private Dictionary getDictionary() {
// lazy initialization here
}
}
来源:https://stackoverflow.com/questions/14276749/passing-a-filename-to-java-udf-from-pig-using-distributed-cache