What is RDD in spark

后端 未结 9 1519
傲寒
傲寒 2020-12-12 19:20

Definition says:

RDD is immutable distributed collection of objects

I don\'t quite understand what does it mean. Is it like da

9条回答
  •  一向
    一向 (楼主)
    2020-12-12 19:49

    Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs.

    RDDs have the following properties –

    1. Immutability and partitioning: RDDs composed of collection of records which are partitioned. Partition is basic unit of parallelism in a RDD, and each partition is one logical division of data which is immutable and created through some transformations on existing partitions.Immutability helps to achieve consistency in computations.

      Users can define their own criteria for partitioning based on keys on which they want to join multiple datasets if needed.

    2. Coarse grained operations: Coarse grained operations are operations which are applied to all elements in datasets. For example – a map, or filter or groupBy operation which will be performed on all elements in a partition of RDD.

    3. Fault Tolerance: Since RDDs are created over a set of transformations , it logs those transformations, rather than actual data.Graph of these transformations to produce one RDD is called as Lineage Graph.

    For example –

    firstRDD=sc.textFile("hdfs://...")
    
    secondRDD=firstRDD.filter(someFunction);
    
    thirdRDD = secondRDD.map(someFunction);
    
    result = thirdRDD.count()
    

    In case of we lose some partition of RDD , we can replay the transformation on that partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.This characteristic is biggest benefit of RDD , because it saves a lot of efforts in data management and replication and thus achieves faster computations.

    1. Lazy evaluations: Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations. So , in above example RDD will be evaluated only when count() action is invoked.

    2. Persistence: Users can indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage or on Disk etc.)

    These properties of RDDs make them useful for fast computations.

提交回复
热议问题