What is RDD in spark

后端 未结 9 1522
傲寒
傲寒 2020-12-12 19:20

Definition says:

RDD is immutable distributed collection of objects

I don\'t quite understand what does it mean. Is it like da

9条回答
  •  情歌与酒
    2020-12-12 19:56

    RDD (Resilient Distributed Datasets) are an abstraction for representing data. Formally they are a read-only, partitioned collection of records that provides a convenient API.

    RDD provide a performant solution for processing large datasets on cluster computing frameworks such as MapReduce by addressing some key issues:

    • data is kept in memory to reduce disk I/O; this is particularly relevant for iterative computations -- not having to persist intermediate data to disk
    • fault-tolerance (resilience) is obtained not by replicating data but by keeping track of all transformations applied to the initial dataset (the lineage). This way, in case of failure lost data can always be recomputed from its lineage and avoiding data replication again reduces storage overhead
    • lazy evaluation, i.e. computations are carried out first when they're needed

    RDD's have two main limitations:

    • they're immutable (read-only)
    • they only allow coarse-grained transformations (i.e. operations that apply to the entire dataset)

    One nice conceptual advantage of RDD's is that they pack together data and code making it easier to reuse data pipelines.

    Sources: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, An Architecture for Fast and General Data Processing on Large Clusters

提交回复
热议问题