I am going through hadoop definitive guide, where it clearly explains about input splits. It goes like
Input splits doesn’t contain actual data, rath
HDFS block size is an exact number but Input split size is based on our data logic which may be a little different with the configured number
Input splits are logical data units that fed to each mapper. Data is split across valid records. Input splits contain addresses of blocks and byte offsets.
Let's say, you have a text file that spanned across 4 blocks.
File:
a b c d
e f g h
i j k l
m n o p
Blocks:
block1: a b c d e
block2: f g h i j
block3: k l m n o
block4: p
Splits:
Split1: a b c d e f h
Split2: i j k l m n o p
Observe the splits are inline with boundaries (records) from file. Now, each split is fed to a mapper.
If Input split size is less than the block size, you will end up with using more no.of mappers vice versa.
Hope that helps.
Block is the physical representation of data. Split is the logical representation of data present in Block.
Block and split size can be changed in properties.
Map reads data from Block through splits i.e. split act as a broker between Block and Mapper.
Consider two blocks:
Block 1
aa bb cc dd ee ff gg hh ii jj
Block 2
ww ee yy uu oo ii oo pp kk ll nn
Now map reads block 1 till aa to JJ and doesn't know how to read block 2 i.e. block doesn't know how to process different block of information. Here comes a Split it will form a Logical grouping of Block 1 and Block 2 as single Block, then it forms offset(key) and line (value) using inputformat and record reader and send map to process further processing.
If your resource is limited and you want to limit the number of maps you can increase the split size. For example: If we have 640 MB of 10 blocks i.e. each block of 64 MB and resource is limited then you can mention Split size as 128 MB then then logical grouping of 128 MB is formed and only 5 maps will be executed with a size of 128 MB.
If we specify split size is false then whole file will form one input split and processed by one map which it takes more time to process when file is big.
The answer by @user1668782 is a great explanation for the question and I'll try to give a graphical depiction of it.
Assume we have a file of 400MB with consists of 4 records(e.g : csv file of 400MB and it has 4 rows, 100MB each)
This is the exact problem that input splits solve. Input splits respects logical record boundaries.
Lets Assume the input split size is 200MB
Therefore the input split 1 should have both the record 1 and record 2. And input split 2 will not start with the record 2 since record 2 has been assigned to input split 1. Input split 2 will start with record 3.
This is why an input split is only a logical chunk of data. It points to start and end locations with in blocks.
Hope this helps.
Hadoop framework strength is its data locality.So whenever a client request for the hdfs data, framework always checks for the locality else it looks for little I/O utilization.
To 1) and 2): i'm not 100% sure, but if the task cannot complete - for whatever reason, including if something is wrong with the input split - then it is terminated and another one started in it's place: so each maptask gets exactly one split with file info (you can quickly tell if this is the case by debugging against a local cluster to see what information is held in the input split object: I seem to recall it's just the one location).
to 3): if the file format is splittable, then Hadoop will attempt to cut the file down to "inputSplit" size chunks; if not, then it's one task per file, regardless of the file size. If you change the value of minimum-input-split, then you can prevent there being too many mapper tasks that are spawned if each of your input files are divided into the block size, but you can only combine inputs if you do some magic with the combiner class (I think that's what it's called).