MultipleTextOutputFormat alternative in new API

前端 未结 3 1205
[愿得一人]
[愿得一人] 2020-12-10 16:11

As it stands out MultipleTextOutputFormat have not been migrated to the new API. So if we need to choose an output directory and output fiename based on the key-value being

3条回答
  •  一生所求
    2020-12-10 16:39

    I'm using AWS EMR Hadoop 1.0.3, and it is possible to specify different directories and files based on k/v pairs. Use either of the following functions from the MultipleOutputs class:

    public void write(KEYOUT key, VALUEOUT value, String baseOutputPath)
    

    or

    public  void write(String namedOutput, K key, V value,
                            String baseOutputPath)
    

    The former write method requires the key to be the same type as the map output key (in case you are using this in the mapper) or the same type as the reduce output key (in case you are using this in the reducer). The value must also be typed in similar fashion.

    The latter write method requires the key/value types to match the types specified when you setup the MultipleObjects static properties using the addNamedOutput function:

    public static void addNamedOutput(Job job,
                                  String namedOutput,
                                  Class outputFormatClass,
                                  Class keyClass,
                                  Class valueClass)
    

    So if you need different output types than the Context is using, you must use the latter write method.

    The trick to getting different output directories is to pass a baseOutputPath that contains a directory separator, like this:

    multipleOutputs.write("output1", key, value, "dir1/part");
    

    In my case, this created files named "dir1/part-r-00000".

    I was not successful in using a baseOutputPath that contains the .. directory, so all baseOutputPaths are strictly contained in the path passed to the -output parameter.

    For more details on how to setup and properly use MultipleOutputs, see this code I found (not mine, but I found it very helpful; does not use different output directories). https://github.com/rystsov/learning-hadoop/blob/master/src/main/java/com/twitter/rystsov/mr/MultipulOutputExample.java

提交回复
热议问题