Hadoop DistributedCache is deprecated - what is the preferred API?

前端 未结 6 1263
情深已故
情深已故 2020-11-28 04:14

My map tasks need some configuration data, which I would like to distribute via the Distributed Cache.

The Hadoop MapReduce Tutorial shows the usage of the Distribut

6条回答
  •  囚心锁ツ
    2020-11-28 05:18

    To expand on @jtravaglini, the preferred way of using DistributedCache for YARN/MapReduce 2 is as follows:

    In your driver, use the Job.addCacheFile()

    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
    
        Job job = Job.getInstance(conf, "MyJob");
    
        job.setMapperClass(MyMapper.class);
    
        // ...
    
        // Mind the # sign after the absolute file location.
        // You will be using the name after the # sign as your
        // file name in your Mapper/Reducer
        job.addCacheFile(new URI("/user/yourname/cache/some_file.json#some"));
        job.addCacheFile(new URI("/user/yourname/cache/other_file.json#other"));
    
        return job.waitForCompletion(true) ? 0 : 1;
    }
    

    And in your Mapper/Reducer, override the setup(Context context) method:

    @Override
    protected void setup(
            Mapper.Context context)
            throws IOException, InterruptedException {
        if (context.getCacheFiles() != null
                && context.getCacheFiles().length > 0) {
    
            File some_file = new File("./some");
            File other_file = new File("./other");
    
            // Do things to these two files, like read them
            // or parse as JSON or whatever.
        }
        super.setup(context);
    }
    

提交回复
热议问题