What is the efficient way to update value inside Spark's RDD?

前端 未结 3 914
别那么骄傲
别那么骄傲 2020-12-29 12:35

I\'m writing a graph-related program in Scala with Spark. The dataset have 4 million nodes and 4 million edges(you can treat this as a tree), but f

3条回答
  •  孤城傲影
    2020-12-29 12:54

    As functional data structures, RDDs are immutable and an operation on an RDD generates a new RDD.

    Immutability of the structure does not necessarily mean full replication. Persistant data structures are a common functional pattern where operations on immutable structures yield a new structure but previous versions are maintained and often reused.

    GraphX (a 'module' on top of Spark) is a graph API on top of Spark that uses such concept: From the docs:

    Changes to the values or structure of the graph are accomplished by producing a new graph with the desired changes. Note that substantial parts of the original graph (i.e., unaffected structure, attributes, and indicies) are reused in the new graph reducing the cost of this inherently functional data-structure.

    It might be a solution for the problem at hand: http://spark.apache.org/docs/1.0.0/graphx-programming-guide.html

提交回复
热议问题