hadoop/yarn and task parallelization on non-hdfs filesystems
I've instantiated a Hadoop 2.4.1 cluster and I've found that running MapReduce applications will parallelize differently depending on what kind of filesystem the input data is on. Using HDFS, a MapReduce job will spawn enough containers to maximize use of all available memory. For example, a 3-node cluster with 172GB of memory with each map task allocating 2GB, about 86 application containers will be created. On a filesystem that isn't HDFS (like NFS or in my use case, a parallel filesystem), a MapReduce job will only allocate a subset of available tasks (e.g., with the same 3-node cluster,