Pass directories not files to hadoop-streaming?
In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example: logs/Customer_One/2011-01-02-001 logs/Customer_One/2012-02-03-001 logs/Customer_One/2012-02-03-002 logs/Customer_Two/2009-03-03-001 logs/Customer_Two/2009-03-03-002 Each individual log set may itself be five or six levels deep and contain thousands of files. Therefore, I actually want the individual map jobs to handle walking the subdirectories: simply enumerating individual files is part of my distributed computing