Is caching the only advantage of spark over map-reduce?

前端未结

关注

 5  1410

轻奢々 2020-12-08 03:22

I have started to learn about Apache Spark and am very impressed by the framework. Although one thing which keeps bothering me is that in all Spark presentations they talk a

5条回答

广开言路 (楼主)

2020-12-08 03:36
- Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
- Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
- Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
- Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
- Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
- The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
- A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
- Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
- Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
- Spark code base is much smaller.
Hope this helps.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...