I know we can set the property \"mapred.job.reuse.jvm.num.tasks\" to re-use JVM. My questions are:
(1) how to decide the number of tasks to be set here, -1 or some o
JVM reuse(only possible in MR1) should help with performance because it removes the startup lag of the JVM but it is only marginal and comes with a number of drawbacks(read side effects. Most tasks will run for a long time (tens of seconds or even minutes) and startup times are not the problem when you look at those task run times. You would like to start a new task on a clean slate. When you re-use a JVM there is a chance that the heap is not completely clean(it is fragmented from the previous runs). The fragmentation can lead to more GC's and nullify all the start up time gains. If there is a memory leak it could also affect the memory usage etc. So it's better to start a new JVM for the tasks(if the tasks are not reasonably small). In MR2(YARN) - new JVM is always started for the tasks. For Uber tasks - it will run the task in the local JVM only.