Hadoop : Provide directory as input to MapReduce job

前端 未结 4 2057
自闭症患者
自闭症患者 2020-12-16 04:11

I\'m using Cloudera Hadoop. I\'m able to run simple mapreduce program where I provide a file as input to MapReduce program.

This file contains all the other files to

相关标签:
4条回答
  • 2020-12-16 04:26

    you could use FileSystem.listStatus to get the file list from given dir, the code could be as below:

    //get the FileSystem, you will need to initialize it properly
    FileSystem fs= FileSystem.get(conf); 
    //get the FileStatus list from given dir
    FileStatus[] status_list = fs.listStatus(new Path(args[0]));
    if(status_list != null){
        for(FileStatus status : status_list){
            //add each file to the list of inputs for the map-reduce job
            FileInputFormat.addInputPath(conf, status.getPath());
        }
    }
    
    0 讨论(0)
  • 2020-12-16 04:40

    Use MultipleInputs class.

    MultipleInputs. addInputPath(Job job, Path path, Class<? extends InputFormat> 
    inputFormatClass, Class<? extends Mapper> mapperClass)
    

    Have a look at working code

    0 讨论(0)
  • 2020-12-16 04:45

    The Problem is FileInputFormat doesn't read files recursively in the input path dir.

    Solution: Use Following code

    FileInputFormat.setInputDirRecursive(job, true); Before below line in your Map Reduce Code

    FileInputFormat.addInputPath(job, new Path(args[0]));

    You can check here for which version it was fixed.

    0 讨论(0)
  • 2020-12-16 04:45

    you can use hdfs wildcards in order to provide multiple files

    so, the solution :

    hadoop jar ABC.jar /folder1/* /output
    

    or

    hadoop jar ABC.jar /folder1/*.txt /output
    
    0 讨论(0)
提交回复
热议问题