What is RDD in spark

后端 未结 9 1508
傲寒
傲寒 2020-12-12 19:20

Definition says:

RDD is immutable distributed collection of objects

I don\'t quite understand what does it mean. Is it like da

相关标签:
9条回答
  • 2020-12-12 20:00

    RDD is a way of representing data in spark.The source of data can be JSON,CSV textfile or some other source. RDD is fault tolerant which means that it stores data on multiple locations(i.e the data is stored in distributed form ) so if a node fails the data can be recovered. In RDD data is available at all times. However RDD are slow and hard to code hence outdated. It has been replaced by concept of DataFrame and Dataset.

    0 讨论(0)
  • 2020-12-12 20:03

    RDD = Resilient Distributed Dataset

    Resilient (Dictionary meaning) = (of a substance or object) able to recoil or spring back into shape after bending, stretching, or being compressed

    RDD is defined as (from LearningSpark - OREILLY): The ability to always recompute an RDD is actually why RDDs are called “resilient.” When a machine holding RDD data fails, Spark uses this ability to recompute the missing partitions, transparent to the user.

    This means 'data' is surely available at all times. Also, Spark can run without Hadoop and hence data is NOT replicated. One of the best characterstics of Hadoop2.0 is 'High Availbility' with the help of Passive Standby Namenode. The same is achieved by RDD in Spark.

    A given RDD (Data) can span across various nodes in Spark cluster (like in Hadoop based cluster).

    If any node crashes, Spark can re-compute the RDD and loads the data in some other node, and data is always available. Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel (http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds)

    0 讨论(0)
  • 2020-12-12 20:04

    To compare RDD with scala collection, below are few differences

    1. Same but runs on a cluster
    2. Lazy in nature where scala collections are strict
    3. RDD is always Immutable i.e., you can not change the state of the data in the collection
    4. RDD are self recovered i.e., fault-tolerant
    0 讨论(0)
提交回复
热议问题