Hadoop - how are map-reduce tasks know which part of a file to handle?

问题

I've been starting to learn hadoop, and currently I'm trying to process log files that are not too well structured - in that the value I normally use for the M/R key is typiclly found at the top of the file (once). So basically my mapping function takes that value as key and then scans the rest of the file to aggregate the values needed to be reduced. So a [fake] log might look like this:

## log.1
SOME-KEY
2012-01-01 10:00:01 100
2012-01-02 08:48:56 250
2012-01-03 11:01:56 212
.... many more rows

## log.2
A-DIFFERENT-KEY
2012-01-01 10:05:01 111
2012-01-02 16:46:20 241
2012-01-03 11:01:56 287
.... many more rows

## log.3
SOME-KEY
2012-02-01 09:54:01 16
2012-02-02 05:53:56 333
2012-02-03 16:53:40 208
.... many more rows

I want to accumulate the 3rd column for each key. I have a cluster of several nodes running this job, and so I was bothered by several issues:

1. File Distribution

Given that hadoop's HDFS works in 64Mb blocks (by default), and every file is distributed over the cluster, can I be sure that the correct key will be matched against the proper numbers? That is, if the block containing the key is in one node, and a block containing data for that same key (a different part of the same log) is on a different machine - how does the M/R framework match the two (if at all)?

2. Block Assignment

For text logs such as the ones described, how is each block's cutoff point decided? Is it after a row ends, or exactly at 64Mb (binary)? Does it even matter? This relates to my #1, where my concern is that the proper values are matched with the correct keys over the entire cluster.

3. File structure

What is the optimal file structure (if any) for M/R processing? I'd probably be far less worried if a typical log looked like this:

A-DIFFERENT-KEY 2012-01-01 10:05:01 111
SOME-KEY        2012-01-02 16:46:20 241
SOME-KEY        2012-01-03 11:01:56 287
A-DIFFERENT-KEY 2012-02-01 09:54:01 16
A-DIFFERENT-KEY 2012-02-02 05:53:56 333
A-DIFFERENT-KEY 2012-02-03 16:53:40 208
...

However, the logs are huge and it would be very costly (time) to convert them to the above format. Should I be concerned?

4. Job Distribution

Are the jobs assigned such that only a single JobClient handles an entire file? Rather, how are the keys/values coordinated between all the JobClients? Again, I'm trying to guarentee that my shady log structure still yields correct results.

回答1:

Given that hadoop's HDFS works in 64Mb blocks (by default), and every file is distributed over the cluster, can I be sure that the correct key will be matched against the proper numbers? That is, if the block containing the key is in one node, and a block containing data for that same key (a different part of the same log) is on a different machine - how does the M/R framework match the two (if at all)?

How the keys and the values are mapped depends on the InputFormat class. Hadoop has a couple of InputFormat classes and custom InputFormat classes can also be defined.

If FileInputFormat is used then the key to the mapper is the file off-set and the value is the line in the input file. In most of cases the file off-set is ignored and the value which is a line in the input file is processed by the mapper. So, by default each line in the log file will be a value to to the mapper.

There might be case where related data in a log file as in the OP might be split across blocks, each block will be processed by a different mapper and Hadoop cannot relate them. One way it to let a single mapper process the complete file by using the FileInputFormat#isSplitable method. This is not an efficient approach if the file size is too large.

For text logs such as the ones described, how is each block's cutoff point decided? Is it after a row ends, or exactly at 64Mb (binary)? Does it even matter? This relates to my #1, where my concern is that the proper values are matched with the correct keys over the entire cluster.

Each block in HDFS by default is exactly 64MB size unless the file size is less than 64MB or the default block size has been modfied, record boundaries are not considered. Some part of the line in the input can be in one block and the rest in another. Hadoop understands record boundaries, so even if a record (line) is split across blocks, it will be still processed by a single mapper only. For this some data transfer might be required from the next block.

Are the jobs assigned such that only a single JobClient handles an entire file? Rather, how are the keys/values coordinated between all the JobClients? Again, I'm trying to guarentee that my shady log structure still yields correct results.

Not exactly clear what the query is. Would suggest to go through some tutorials and get back with queries.

来源：https://stackoverflow.com/questions/8894902/hadoop-how-are-map-reduce-tasks-know-which-part-of-a-file-to-handle

标签

Hadoop

MapReduce

filesystems

block

HDFS