Pass directories not files to hadoop-streaming?

泄露秘密 提交于 2019-12-09 10:49:05

问题


In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example:

  • logs/Customer_One/2011-01-02-001
  • logs/Customer_One/2012-02-03-001
  • logs/Customer_One/2012-02-03-002
  • logs/Customer_Two/2009-03-03-001
  • logs/Customer_Two/2009-03-03-002

Each individual log set may itself be five or six levels deep and contain thousands of files.

Therefore, I actually want the individual map jobs to handle walking the subdirectories: simply enumerating individual files is part of my distributed computing problem!

Unfortunately, when I try passing a directory containing only log subdirectories to Hadoop, it complains that I can't pass those subdirectories to my mapper. (Again, I have written to accept subdirectories as input):

$ hadoop jar "${HADOOP_HOME}/contrib/streaming/hadoop-streaming-${HADOOP_VERSION}.jar" -input file:///mnt/logs/Customer_Name/ -file mapper.sh -mapper "mapper.sh" -file reducer.sh -reducer "reducer.sh" -output .

[ . . . ]

12/04/10 12:48:35 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:java.io.IOException: Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 12/04/10 12:48:35 ERROR streaming.StreamJob: Error Launching job : Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 Streaming Command Failed! [cloudera@localhost ~]$

Is there a straightforward way to convince Hadoop-streaming to permit me to assign directories as work items?


回答1:


I guess you need to investigate writing a custom InputFormat which you can pass the root directory too, it will create a split for each customer, and then the record reader for each split will do the directory walk and push the file contents to your mappers




回答2:


Hadoop supports input paths to be a regular expression. I haven't experimented with a lot of complex regex, but the simple placeholders ? and * does work.

So in your case I think if you have the following as your input path it will work :

file:///mnt/logs/Customer_Name/*/*

The last asterisk might not be needed as all the files in the final directory are automatically added as input path.



来源:https://stackoverflow.com/questions/10095717/pass-directories-not-files-to-hadoop-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!