Access hdfs file from udf

笑着哭i 提交于 2019-12-21 02:46:11

问题


I`d like to access a file from my udf call. This is my script:

files = LOAD '$docs_in' USING PigStorage(';') AS (id, stopwords, id2, file);
buzz = FOREACH files GENERATE pigbuzz.Buzz(file, id) as file:bag{(year:chararray, word:chararray, count:long)}; 

The jar is registered. The path is realtive to my hdfs, where the files really exist. The call is made. But seems that the file is not discovered. Maybe beacause I'm trying to access the file on hdfs.

How can I access a file in hdfs, from my UDF java call?


回答1:


Inside an EvalFunc you can get a file from the HDFS via:

FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
in = fs.open(new Path(fileName));
BufferedReader br = new BufferedReader(new InputStreamReader(in));
....

You might also consider putting the files into the distributed cache, in that case you have to override getCacheFiles() in your EvalFunc class.

E.g:

@Override
public List<String> getCacheFiles() {
  List<String> list = new ArrayList<String>(2);
  list.add("/cache/pig/wordlist1.txt#w1");
  list.add("/cache/pig/wordlist2.txt#w2");
  return list;
}

then you can just pass the symlinks of the files (w1 and w2) in order to get them from the local file system of each of the worker nodes:

BufferedReader br = new BufferedReader(new FileReader(fileName));


来源:https://stackoverflow.com/questions/17514022/access-hdfs-file-from-udf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!