What is RDD in spark

后端 未结 9 1540
傲寒
傲寒 2020-12-12 19:20

Definition says:

RDD is immutable distributed collection of objects

I don\'t quite understand what does it mean. Is it like da

9条回答
  •  渐次进展
    2020-12-12 19:43

    Resilient Distributed Dataset (RDD) is the way Spark represents data. The data can come from various sources :

    • Text File
    • CSV File
    • JSON File
    • Database (via JBDC driver)

    RDD in relation to Spark

    Spark is simply an implementation of RDD.

    RDD in relation to Hadoop

    The power of Hadoop reside in the fact that it let users write parallel computations without having to worry about work distribution and fault tolerance. However, Hadoop is inefficient for the applications that reuse intermediate results. For example, iterative machine learning algorithms, such as PageRank, K-means clustering and logistic regression, reuse intermediate results.

    RDD allows to store intermediate results inside the RAM. Hadoop would have to write it to an external stable storage system, which generate disk I/O and serialization. With RDD, Spark is up to 20X faster than Hadoop for iterative applications.

    Futher implementations details about Spark

    Coarse-Grained transformations

    The transformations applied to an RDD are Coarse-Grained. This means that the operations on a RDD are applied to the whole dataset, not on its individual elements. Therefore, operations like map, filter, group, reduce are allowed, but operations like set(i) and get(i) are not.

    The inverse of coarse-grained is fine-grained. A fine-grained storage system would be a database.

    Fault Tolerant

    RDD are fault tolerant, which is a property that enable the system to continue working properly in the event of the failure of one of its components.

    The fault tolerance of Spark is strongly linked to its coarse-grained nature. The only-way to implement fault tolerance in a fine-grained storage system is to replicate its data or log updates across machines. However, in a coarse-grained system like Spark, only the transformations are logged. If a partition of an RDD is lost, the RDD has enough information the recompute it quickly.

    Data storage

    The RDD is "distributed" (separated) in partitions. Each partitions can be present in the memory or on the disk of a machine. When Spark wants to launch a task on a partition, he sends it to the machine containing the partition. This is know as "locally aware scheduling".

    Sources : Great research papers about Spark : http://spark.apache.org/research.html

    Include the paper suggested by Ewan Leith.

提交回复
热议问题