how to sort numerically in hadoop's shuffle/sort phase?

前端 未结 3 1628
醉话见心
醉话见心 2020-12-13 16:05

The data looks like this, first field is a number,

3 ...
1 ...
2 ...
11 ...

And I want to sort these lines according to the first field num

3条回答
  •  北海茫月
    2020-12-13 16:23

    Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.

    1. -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should be added to streaming command

    2. You need to provide type of sorting required using mapred.text.key.comparator.options. Some useful ones are -n : numeric sort, -r : reverse sort

    EXAMPLE :

    Create an identity mapper and reducer with the following code

    This is the mapper.py & reducer.py

    #!/usr/bin/env python
    import sys
    for line in sys.stdin:    
        print "%s" % (line.strip())
    

    This is the input.txt

    1
    11
    2
    20
    7
    3
    40
    

    This is the Streaming command

    $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar 
    -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator 
    -D  mapred.text.key.comparator.options=-n 
    -input /user/input.txt 
    -output /user/output.txt 
    -file ~/mapper.py 
    -mapper ~/mapper.py 
    -file ~/reducer.py 
    -reducer ~/reducer.py
    

    And you will get the required output

    1   
    2   
    3   
    7   
    11  
    20  
    40
    

    NOTE :

    1. I have used a simple one key input. If however you have multiple keys and/or partitions, you will have to edit mapred.text.key.comparator.options as needed. Since I do not know your use case , my example is limited to this

    2. Identity mapper is needed since you will need atleast one mapper for a MR job to run.

    3. Identity reducer is needed since shuffle/sort phase will not work if it is a pure map only job.

提交回复
热议问题