This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of
There is a seperate map reduce job that splits the files into blocks. Use FileInputFormat for large files and CombineFileInput Format for smaller ones. You can also check the whether the input can be split into blocks by issplittable method. Each block is then fed to a data node where a map reduce job runs for further analysis. the size of a block would depend on the size that you have mentioned in mapred.max.split.size parameter.