问题
In hadoop, the mapper receives the key as the position in the file like "0, 23, 45, 76, 123", which I think are byte offsets.
I have two large input files where I need to split in a manner where the same regions (in terms of number of lines, eg. 400 lines) of the file get the same key. Byte offset is clearly not the best option for that.
I was wondering if there is a way or option to change the keys to an integer so the output keys will be: "1, 2, 3, 4, 5" instead of "0, 23, 45, 76, 123"?
Thank you!
回答1:
In hadoop, the mapper receives the key as the position in the file like "0, 23, 45, 76, 123", which I think are byte offsets.
Yes. But not always. It is true if you are using TextInputFormat(as in your case). Keys and values depend on the type of InputFormat you are using and change accordingly.
I was wondering if there is a way or option to change the keys to an integer so the output keys will be: "1, 2, 3, 4, 5" instead of "0, 23, 45, 76, 123"?
You can write your own custom InputFormat by subclassing FileInputFormat to achieve this.
回答2:
You can track the line number yourself in the mapper:
protected int recNo = 0;
protected void map(LongWritable key, Text value, Context context) {
++recNo;
// mapper implementation
// ...
}
But this doesn't account splitable files (a file which is stored in 2 or more blocks and is splitable - not using gzip compression for example). In this case every split will be numbered with line numbers from 1, rather than the line number from the beginning of the file. You mention that you have two large files - so either you'll need to force the input format's minimum split size larger than the size of the files or compress your files using a non splitable compression codec (to force single task per file processing) such as gzip.
回答3:
That is possible, If i am getting right then you want to index all records in increment order.
I have done that. You can take advantage of framework. It is how we program in GPU. Overview you can split in file in splits with same number of recored per line. that will allow you to index particular index. Formula after file split is
ActualIndex = splitNubmer * Num_Of_record_Per_Split + record_Offset
Now will go in detail.
First create Splits with NLineInputFormat
, Which allows to index records in particular split. Emmit record with key as splitId + redordIndex in split + actual record
. Now we have indexed split in Map phase.
Then you need to use custom SortComaprator
which sorts intermediate output by SplitId
in key. Then costume groupComarator
which groups All keys with same SplitId
.
Now in reducer you can use above formula. to index records.
But problem is How do we Identify splitNumber in ascending order. I solved that by.
Hadoop splits file By file_HDFS_URL/file_name:StartOffset+Length
example: hdfs://server:8020/file.txt:0+400, hdfs://server:8020/file.txt:400+700, and So on.
I created one file in HDFS that record all splits startOffset. Then use it in Reducer. This way use can use fully parallel, Record Indexing.
来源:https://stackoverflow.com/questions/17835954/hadoop-sort-the-key-and-change-the-key-value