MapReduce or Spark? [closed]

别等时光非礼了梦想. 提交于 2019-11-29 21:26:21

MapReduce is batch oriented in nature. So, any frameworks on top of MR implementations like Hive and Pig are also batch oriented in nature. For iterative processing as in the case of Machine Learning and interactive analysis, Hadoop/MR doesn't meet the requirement. Here is a nice article from Cloudera on Why Spark which summarizes it very nicely.

It's not an end of MR. As of this writing Hadoop is much mature when compared to Spark and a lot of vendors support it. It will change over time. Cloudera has started including Spark in CDH and over time more and more vendors would be including it in their Big Data distribution and providing commercial support for it. We would see MR and Spark in parallel for foreseeable future.

Also with Hadoop 2 (aka YARN), MR and other models (including Spark) can be run on a single cluster. So, Hadoop is not going anywhere.

Depends what you want to do.

MapReduce's greatest strength is processing lots of large text files. Hadoop's implementation is built around string processing, and it's very I/O heavy.

The problem with MapReduce is that people see the easy parallelism hammer and everything starts to look like a nail. Unfortunately Hadoop's performance for anything other than processing large text files is terrible. If you write a decent parallel code you can often have it finish before Hadoop even spawns its first VM. I've seen differences of 100x in my own codes.

Spark eliminates a lot of Hadoop's overheads, such as the reliance on I/O for EVERYTHING. Instead it keeps everything in-memory. Great if you have enough memory, not so great if you don't.

Remember that Spark is an extension of Hadoop, not a replacement. If you use Hadoop to process logs, Spark probably won't help. If you have more complex, maybe tightly-coupled problems then Spark would help a lot. Also, you may like Spark's Scala interface for on-line computations.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!