Use global variable in reudcer class

£可爱£侵袭症+ 提交于 2019-12-03 05:06:12

Template code could look something like this (Reducer not shown but is the same principal)

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class ToolExample extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        Job job = new Job(getConf());
        Configuration conf = job.getConfiguration();

        conf.set("strProp", "value");
        conf.setInt("intProp", 123);
        conf.setBoolean("boolProp", true);

        // rest of your config here
        // ..

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static class MyMapper extends
            Mapper<LongWritable, Text, LongWritable, Text> {
        private String strProp;
        private int intProp;
        private boolean boolProp;

        @Override
        protected void setup(Context context) throws IOException,
                InterruptedException {
            Configuration conf = context.getConfiguration();

            strProp = conf.get("strProp");
            intProp = conf.getInt("intProp", -1);
            boolProp = conf.getBoolean("boolProp", false);
        }
    }

    public static void main(String args[]) throws Exception {
        System.exit(ToolRunner.run(new ToolExample(), args));
    }
}

In a cluster(other than local) environment, MapReduce programs are run its own JVM if the map/reduce programs are written in Java (separate process in other languages). With this, you can't achieve declaring a static variable and value in a class and change along the way in MapReduce flow and expect the value in another JVM. A shared object is what you need so that either mapper/reduce can set and get the values.

There are few ways to achieve this.

  1. As Chris mentioned, use Configuration set()/get() methods to pass values to mapper and/or to reducer. In this case, you must set the values to Configuration object before you create a Job.

  2. Use the HDFS file to write your data and read from either mapper/reducer. Remember to clean up the above created HDFS file.

Hadoop Counters(User-defined) are the other kind of global variables. These values can be viewed after the job is finished. eg: If you want to count number of erroneous/good records across your input(which is processed by various mappers/reducers), you can go for counters. @Mo: You can use counters for your requirement

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!