In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example:
- logs/Customer_One/2011-01-02-001
- logs/Customer_One/2012-02-03-001
- logs/Customer_One/2012-02-03-002
- logs/Customer_Two/2009-03-03-001
- logs/Customer_Two/2009-03-03-002
Each individual log set may itself be five or six levels deep and contain thousands of files.
Therefore, I actually want the individual map jobs to handle walking the subdirectories: simply enumerating individual files is part of my distributed computing problem!
Unfortunately, when I try passing a directory containing only log subdirectories to Hadoop, it complains that I can't pass those subdirectories to my mapper. (Again, I have written to accept subdirectories as input):
$ hadoop jar "${HADOOP_HOME}/contrib/streaming/hadoop-streaming-${HADOOP_VERSION}.jar" -input file:///mnt/logs/Customer_Name/ -file mapper.sh -mapper "mapper.sh" -file reducer.sh -reducer "reducer.sh" -output .
[ . . . ]
12/04/10 12:48:35 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:java.io.IOException: Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 12/04/10 12:48:35 ERROR streaming.StreamJob: Error Launching job : Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 Streaming Command Failed! [cloudera@localhost ~]$
Is there a straightforward way to convince Hadoop-streaming to permit me to assign directories as work items?
I guess you need to investigate writing a custom InputFormat which you can pass the root directory too, it will create a split for each customer, and then the record reader for each split will do the directory walk and push the file contents to your mappers
Hadoop supports input paths to be a regular expression. I haven't experimented with a lot of complex regex, but the simple placeholders ? and * does work.
So in your case I think if you have the following as your input path it will work :
file:///mnt/logs/Customer_Name/*/*
The last asterisk might not be needed as all the files in the final directory are automatically added as input path.
来源:https://stackoverflow.com/questions/10095717/pass-directories-not-files-to-hadoop-streaming