What is the efficient way to update value inside Spark's RDD?

前端 未结 3 909
别那么骄傲
别那么骄傲 2020-12-29 12:35

I\'m writing a graph-related program in Scala with Spark. The dataset have 4 million nodes and 4 million edges(you can treat this as a tree), but f

3条回答
  •  暖寄归人
    2020-12-29 13:14

    An RDD is a distributed data set, a partition is the unit for RDD storage, and the unit to process and RDD is an element.

    For example, you read a large file from HDFS as an RDD, then the element of this RDD is String(lines in that file), and spark stores this RDD across the cluster by partition. For you, as a spark user, you only need to care about how to deal with the lines of that files, just like you are writing a normal program, and you read a file from local file system line by line. That's the power of spark:)

    Anyway, you have no idea which elements will be stored in a certain partition, so it doesn't make sense to update a certain partition.

提交回复
热议问题