What are some scenarios for which MPI is a better fit than MapReduce?

前端 未结 5 1422
执笔经年
执笔经年 2020-12-22 17:51

As far as I understand, MPI gives me much more control over how exactly different nodes in the cluster will communicate.

In MapReduce/Hadoop, each node does some com

相关标签:
5条回答
  • 2020-12-22 18:08

    Athough, this question has been answered, I would like to add/reiterate one very important point.

    MPI is best suited for problems that require a lot of interprocess communication.

    When Data becomes large (petabytes, anyone?), and there is little interprocess communication, MPI becomes a pain. This is so because the processes will spend all the time sending data to each other (bandwidth becomes a limiting factor) and your CPUs will remain idle. Perhaps an even bigger problem is reading all that data.

    This is the fundamental reason behind having something like Hadoop. The Data also has to be distributed - Hadoop Distributed File System!

    To say all this in short, MPI is good for task parallelism and Hadoop is good for Data Parallelism.

    0 讨论(0)
  • 2020-12-22 18:11

    I expect that MPI beats MapReduce easily when the task is iterating over a data set whose size is comparable with the processor cache, and when communication with other tasks is frequently required. Lots of scientific domain-decomposition parallelization approaches fit this pattern. If MapReduce requires sequential processing and communication, or ending of processes, then the computational performance benefit from dealing with a cache-sized problem is lost.

    0 讨论(0)
  • 2020-12-22 18:11

    When the computation and data that you are using have irregular behaviors that mostly translates to many message-passings between objects, or when you need low level hardware level accesses e.g. RDMA then MPI is better. In some answers that you see in here the latency of tasks or memory consistency model gets mentioned, frameworks like Spark or Actor Models like AKKA have shown that they can compete with MPI. Finally one should consider that MPI has benefit of being for years the main base for development of libraries needed for scientific computations (This are the most important missing parts missing from new frameworks using DAG/MapReduce Models).

    All in all, I think the benefits that MapReduce/DAG models are bringing to the table like dynamic resource managers, and fault tolerance computation will make make them feasible for scientific computing groups.

    0 讨论(0)
  • 2020-12-22 18:18

    Almost any scientific code -- finite differences, finite elements, etc. Which kind of leads to the circular answer, that any distributed program which doesn't easily map to MapReduce would be better implemented with a more general MPI model. Not sure that's much help to you, I'll downvote this answer right after I post it.

    0 讨论(0)
  • 2020-12-22 18:23

    The best answer that I could come up with is that MPI is better than MapReduce in two cases:

    1. For short tasks rather than batch processing. For example, MapReduce cannot be used to respond to individual queries - each job is expected to take minutes. I think that in MPI, you can build a query response system where machines send messages to each other to route the query and generate the answer.

    2. For jobs nodes need to communicate more than what iterated MapReduce jobs support, but not too much so that the communication overheads make the computation impractical. I am not sure how often such cases occur in practice, though.

    0 讨论(0)
提交回复
热议问题