What is RDD in spark

后端未结

关注

 9  1521

傲寒

Definition says:

RDD is immutable distributed collection of objects

I don\'t quite understand what does it mean. Is it like da

相关标签:

9条回答

野趣味

2020-12-12 20:00

RDD is a way of representing data in spark.The source of data can be JSON,CSV textfile or some other source. RDD is fault tolerant which means that it stores data on multiple locations(i.e the data is stored in distributed form ) so if a node fails the data can be recovered. In RDD data is available at all times. However RDD are slow and hard to code hence outdated. It has been replaced by concept of DataFrame and Dataset.

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-12-12 20:03

RDD = Resilient Distributed Dataset

Resilient (Dictionary meaning) = (of a substance or object) able to recoil or spring back into shape after bending, stretching, or being compressed

RDD is defined as (from LearningSpark - OREILLY): The ability to always recompute an RDD is actually why RDDs are called “resilient.” When a machine holding RDD data fails, Spark uses this ability to recompute the missing partitions, transparent to the user.

This means 'data' is surely available at all times. Also, Spark can run without Hadoop and hence data is NOT replicated. One of the best characterstics of Hadoop2.0 is 'High Availbility' with the help of Passive Standby Namenode. The same is achieved by RDD in Spark.

A given RDD (Data) can span across various nodes in Spark cluster (like in Hadoop based cluster).

If any node crashes, Spark can re-compute the RDD and loads the data in some other node, and data is always available. Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel (http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds)

0 讨论(0)
发布评论:

提交评论
- 加载中...
抹茶落季

2020-12-12 20:04
To compare RDD with scala collection, below are few differences
1. Same but runs on a cluster
2. Lazy in nature where scala collections are strict
3. RDD is always Immutable i.e., you can not change the state of the data in the collection
4. RDD are self recovered i.e., fault-tolerant
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2