问题
I have 4 core desktop and want to use all my cores for local data processing with hadoop. (i.e. sometimes I have enough power to process data locally sometimes I submit same jobs to cluster).
By default hadoop local mode runs only one mapper and one reducer so my local jobs are really slow. I do not want to setup cluster on single machine first because of "painful" configuration and second I have to create jar each time. So perfect solution is to how run embedded Hadoop on a single machine
PS pseudo-distributed mode is bad option since it will create cluster with Single node, so I will get only one mapper and I have to spend some time on additional configuration.
回答1:
You need to use MultithreadedMapRunner - just set up it in JobConf's setMapRunnerClass method and don't forget to set mapred.map.multithreadedrunner.threads to desirable concurrency level.
Also there is an another way, you should:
- set MultithreadedMapper as your mapper class in Job-typed object
- call
MultithreadedMapper.setMapperClass
with you actual mapper class - call
MultithreadedMapper.setNumberOfThreads
with desirable concurrency level
But be careful, your mapper class should be thread safe and it's setup and cleanup methods would be called several times, so it isn't a smart idea to mix MultithreadedMapper with MultipulOutput, unless you implement you own MultithreadedMapper inspired class.
回答2:
Hadoop purposely does not run more than one task at the same time in one JVM for isolation purposes. And in stand-alone (local) mode, only one JVM is ever used. If you want to make use of your four cores, you should run in pseudo-distributed mode, and increase the max number of concurrent tasks to four. You can do this with the mapred.tasktracker.map.tasks.maximum
and mapred.tasktracker.reduce.tasks.maximum
properties.
回答3:
Configuration conf = new Configuration();
Job job = new Job(conf, "SolerRandomHit");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(MultithreadedMapper.class);
来源:https://stackoverflow.com/questions/12504690/how-to-run-hadoop-multithread-way-in-single-jvm