hadoop method to send output to multiple directories

后端 未结 2 430
野趣味
野趣味 2020-12-16 07:58

My MapReduce job processes data by dates and needs to write output to a certain folder structure. Current expectation is to generate out put in following struct

相关标签:
2条回答
  • 2020-12-16 08:20

    You should not need a second job. I am currently using MultipleOutputs to create a ton of output Directories in one of my programs. Despite there being upwards of 30 directories I am able to use only a couple of MultipleOutputs objects. This is because you can set output directory when you write, so it can be determined only when needed. You only actually need more than one namedOutput if you want to output in different formats (ex. one with key: Text.class, value: Text.class and one with key: Text.class and Value: IntWritable.class)

    setup:

    MultipleOutputs.addNamedOutput(job, "Output", TextOutputFormat.class, Text.class, Text.class);
    

    setup of reducer:

    mout = new MultipleOutputs<Text, Text>(context);
    

    calling mout in reducer:

    String key; //set to whatever output key will be
    String value; //set to whatever output value will be
    String outputFileName; //set to absolute path to file where this should write
    
    mout.write("Output",new Text(key),new Text(value),outputFileName);
    

    you can have a piece of code determine the directory while coding. For example say you want to specify directory by month and year:

    int year;//extract year from data
    int month;//extract month from data
    String baseFileName; //parent directory to all outputs from this job
    String outputFileName = baseFileName + "/" + year + "/" + month;
    
    mout.write("Output",new Text(key),new Text(value),outputFileName);
    

    Hope this helps.

    EDIT: output file structure for above example:

    Base
        2013
            01
            02
            03
            ...
        2012
            01
            ...
        ...
    
    0 讨论(0)
  • 2020-12-16 08:39

    Most probably you missed to close the mos in the cleanup.

    If you have a setup in mapper or reducer like below:

    public void setup(Context context) {mos = new MultipleOutputs(context);}
    

    you should close mos at the start of your cleanup, like below..

    public void cleanup(Context context ) throws IOException, InterruptedException {mos.close();}
    
    0 讨论(0)
提交回复
热议问题