Is caching the only advantage of spark over map-reduce?

前端 未结 5 1410
轻奢々
轻奢々 2020-12-08 03:22

I have started to learn about Apache Spark and am very impressed by the framework. Although one thing which keeps bothering me is that in all Spark presentations they talk a

5条回答
  •  广开言路
    2020-12-08 03:36

    • Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory

    • Spark loads a process into memory and keeps it there until further notice, for the sake of caching.

    • Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.

    • Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.

    • Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.

    • The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.

    • A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.

    • Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.

    • Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.

    • Spark code base is much smaller.

    Hope this helps.

提交回复
热议问题