how to sort numerically in hadoop's shuffle/sort phase?

前端 未结 3 1629
醉话见心
醉话见心 2020-12-13 16:05

The data looks like this, first field is a number,

3 ...
1 ...
2 ...
11 ...

And I want to sort these lines according to the first field num

3条回答
  •  南方客
    南方客 (楼主)
    2020-12-13 16:29

    For streaming with order Hadoop (which may use -jobconf instead of -D for configuration), you can sort by key:

    -jobconf stream.num.map.output.key.fields=2\
    -jobconf mapreduce.partition.keycomparator.options="-k2,2nr"\
    -jobconf mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
    

    By stream.num.map.output.key.fields, 1st and 2nd columns are key 1 and key 2.

    mapreduce.partition.keycomparator.options="-k2,2nr" means sorting in reverse order by using 2nd key (from 2nd to 2nd keys) as numeric value.

    It is pretty much like Linux sort command!

提交回复
热议问题