What is the efficient way to update value inside Spark's RDD?

前端未结

关注

 3  909

别那么骄傲 2020-12-29 12:35

I\'m writing a graph-related program in Scala with Spark. The dataset have 4 million nodes and 4 million edges(you can treat this as a tree), but f

3条回答

暖寄归人 (楼主)

2020-12-29 13:14

An RDD is a distributed data set, a partition is the unit for RDD storage, and the unit to process and RDD is an element.

For example, you read a large file from HDFS as an RDD, then the element of this RDD is String(lines in that file), and spark stores this RDD across the cluster by partition. For you, as a spark user, you only need to care about how to deal with the lines of that files, just like you are writing a normal program, and you read a file from local file system line by line. That's the power of spark:)

Anyway, you have no idea which elements will be stored in a certain partition, so it doesn't make sense to update a certain partition.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...