发表新帖

发表新帖

What is RDD in spark

后端未结

关注

 9  1533

傲寒 2020-12-12 19:20

Definition says:

RDD is immutable distributed collection of objects

I don\'t quite understand what does it mean. Is it like da

9条回答

醉酒成梦 (楼主)

2020-12-12 20:03

RDD = Resilient Distributed Dataset

Resilient (Dictionary meaning) = (of a substance or object) able to recoil or spring back into shape after bending, stretching, or being compressed

RDD is defined as (from LearningSpark - OREILLY): The ability to always recompute an RDD is actually why RDDs are called “resilient.” When a machine holding RDD data fails, Spark uses this ability to recompute the missing partitions, transparent to the user.

This means 'data' is surely available at all times. Also, Spark can run without Hadoop and hence data is NOT replicated. One of the best characterstics of Hadoop2.0 is 'High Availbility' with the help of Passive Standby Namenode. The same is achieved by RDD in Spark.

A given RDD (Data) can span across various nodes in Spark cluster (like in Hadoop based cluster).

If any node crashes, Spark can re-compute the RDD and loads the data in some other node, and data is always available. Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel (http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds)

0 讨论(0)

查看其它9个回答
发布评论:

提交评论
- 加载中...

热议问题