How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?
I am creating a program to analyze PDF, DOC and DOCX files. These files are stored in HDFS. When I start my MapReduce job, I want the map function to have the Filename as key and the Binary Contents as value. I then want to create a stream reader which I can pass to the PDF parser library. How can I achieve that the key/value pair for the Map Phase is filename/filecontents? I am using Hadoop 0.20.2 This is older code that starts a job: public static void main(String[] args) throws Exception { JobConf conf = new JobConf(PdfReader.class); conf.setJobName("pdfreader"); conf.setOutputKeyClass(Text