Is it possible to create nested RDDs in Apache Spark?

别来无恙 提交于 2019-11-28 00:28:01

No, it is not possible, because the items of an RDD must be serializable and a RDD is not serializable. And this makes sense, otherwise you might transfer over the network a whole RDD which is a problem if it contains a lot of data. And if it does not contain a lot of data, you might and you should use an array or something like it.

However, I don't know how you are implementing the K-nearest neighbor...but be careful: if you do something like calculating the distance between each couple of point, this is actually not scalable in the dataset size, because it's O(n2).

I ran into nullpointer exception while trying something of this sort.As we can't perform operations on RDDs within a RDD.

Spark doesn't support nesting of RDDs the reason being - to perform an operation or create a new RDD spark runtime requires access to sparkcontext object which is available only in the driver machine.

Hence if you want to operate on nested RDDs, you may collect the parent RDD on driver node then iterate it's items using array or something.

Note:- RDD class is serializable. Please see below.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!