I\'m writing a graph-related program in Scala with Spark. The dataset have 4 million nodes and 4 million edges(you can treat this as a tree), but f
An RDD is a distributed data set, a partition is the unit for RDD storage, and the unit to process and RDD is an element.
For example, you read a large file from HDFS as an RDD, then the element of this RDD is String(lines in that file), and spark stores this RDD across the cluster by partition. For you, as a spark user, you only need to care about how to deal with the lines of that files, just like you are writing a normal program, and you read a file from local file system line by line. That's the power of spark:)
Anyway, you have no idea which elements will be stored in a certain partition, so it doesn't make sense to update a certain partition.