Hadoop : Provide directory as input to MapReduce job

前端未结

关注

 4  2057

I\'m using Cloudera Hadoop. I\'m able to run simple mapreduce program where I provide a file as input to MapReduce program.

This file contains all the other files to

相关标签:

4条回答

南方客

2020-12-16 04:26

you could use FileSystem.listStatus to get the file list from given dir, the code could be as below:

//get the FileSystem, you will need to initialize it properly
FileSystem fs= FileSystem.get(conf); 
//get the FileStatus list from given dir
FileStatus[] status_list = fs.listStatus(new Path(args[0]));
if(status_list != null){
    for(FileStatus status : status_list){
        //add each file to the list of inputs for the map-reduce job
        FileInputFormat.addInputPath(conf, status.getPath());
    }
}

0 讨论(0)

南方客

2020-12-16 04:40

Use MultipleInputs class.

MultipleInputs. addInputPath(Job job, Path path, Class<? extends InputFormat> 
inputFormatClass, Class<? extends Mapper> mapperClass)

Have a look at working code

0 讨论(0)

逝去的感伤

2020-12-16 04:45

The Problem is FileInputFormat doesn't read files recursively in the input path dir.

Solution: Use Following code

FileInputFormat.setInputDirRecursive(job, true); Before below line in your Map Reduce Code

FileInputFormat.addInputPath(job, new Path(args[0]));

You can check here for which version it was fixed.

0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2020-12-16 04:45
you can use hdfs wildcards in order to provide multiple files

so, the solution :
```
hadoop jar ABC.jar /folder1/* /output
```
or
```
hadoop jar ABC.jar /folder1/*.txt /output
```
0 讨论(0)
发布评论:

提交评论
- 加载中...