How to run hadoop multithread way in single JVM?

问题

I have 4 core desktop and want to use all my cores for local data processing with hadoop. (i.e. sometimes I have enough power to process data locally sometimes I submit same jobs to cluster).

By default hadoop local mode runs only one mapper and one reducer so my local jobs are really slow. I do not want to setup cluster on single machine first because of "painful" configuration and second I have to create jar each time. So perfect solution is to how run embedded Hadoop on a single machine

PS pseudo-distributed mode is bad option since it will create cluster with Single node, so I will get only one mapper and I have to spend some time on additional configuration.

回答1:

You need to use MultithreadedMapRunner - just set up it in JobConf's setMapRunnerClass method and don't forget to set mapred.map.multithreadedrunner.threads to desirable concurrency level.

Also there is an another way, you should:

set MultithreadedMapper as your mapper class in Job-typed object
call MultithreadedMapper.setMapperClass with you actual mapper class
call MultithreadedMapper.setNumberOfThreads with desirable concurrency level

But be careful, your mapper class should be thread safe and it's setup and cleanup methods would be called several times, so it isn't a smart idea to mix MultithreadedMapper with MultipulOutput, unless you implement you own MultithreadedMapper inspired class.

回答2:

Hadoop purposely does not run more than one task at the same time in one JVM for isolation purposes. And in stand-alone (local) mode, only one JVM is ever used. If you want to make use of your four cores, you should run in pseudo-distributed mode, and increase the max number of concurrent tasks to four. You can do this with the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties.

回答3:

    Configuration conf = new Configuration();

    Job job = new Job(conf, "SolerRandomHit");

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(IntWritable.class);


    job.setMapperClass(MultithreadedMapper.class);

来源：https://stackoverflow.com/questions/12504690/how-to-run-hadoop-multithread-way-in-single-jvm

标签

Hadoop

jvm

MapReduce