How does MapReduce read from multiple input files?

我是研究僧i 提交于 2019-12-03 21:12:23

In order to get the input file path you can use the context object, like this:

FileSplit fileSplit = (FileSplit) context.getInputSplit();
String inputFilePath = fileSplit.getPath().toString();

And as for how it multiple files are processed:

Several instances of the mapper function are created on the different machines in the cluster. Each instance receives a different input file. If files are bigger than the default dfs block size(128 MB) then files are further split into smaller parts and are then distributed to mappers.

So you can configure the input size being received by each mapper by following 2 ways:

  • change the HDFS block size (eg dfs.block.size=1048576)
  • set the paramaeter mapred.min.split.size (this can be only set to larger than HDFS block size)

Note: These parameters will only be effective if your input format supports splitting the input files. Common compression codecs (such as gzip) don't support splitting the files, so these will be ignored.

In continuation to @Amar 's answer , I used FileStatus object in the following code as my customised inoput format would not split the input file.

FileSystem fs = file.getFileSystem(conf);
                FileStatus status= fs.getFileStatus(file);
                String fileName=status.getPath().toString();
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!