Can not access Pipelined Rdd in pyspark [duplicate]

一笑奈何 提交于 2019-12-08 06:45:03

问题


I am trying to implement K-means from scratch using pyspark. I am performing various operations on rdd's but when i try to display the result of the final processed rdd, some error like "Pipelined RDD's cant be iterated" or something like that and things like .collect() do not work again because of the piplined rdd issue.

from __future__ import print_function
import sys
import numpy as np
def closestPoint(p, centers):
    bestIndex = 0
    closest = float("+inf")
    for i in range(len(centers)):
        tempDist = np.sum((p - centers[i]) ** 2)
        if tempDist < closest:
            closest = tempDist
            bestIndex = i
    return bestIndex

data=SC.parallelize([1, 2, 3,5,7,3,5,7,3,6,4,66,33,66,22,55,77])

K = 3
convergeDist = float(0.1)

kPoints = data.takeSample(False, K, 1)
tempDist = 1.0

while tempDist > convergeDist:
    closest = data.map(
        lambda p: (closestPoint(p, kPoints), (p, 1)))



    pointStats = closest.reduceByKey(
        lambda p1_c1, p2_c2: (p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))

    newPoints = pointStats.map(
        lambda st: (st[0], st[1][0] / st[1][1]))
    print(newPoints)


    tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints).collect()

       # tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)




    for (iK, p) in newPoints:
        kPoints[iK] = p

print("Final centers: " + str(kPoints))

The error I am getting is:

TypeError: 'PipelinedRDD' object is not iterable


回答1:


You cannot iterate over an RDD, you need first to call an action to get your data back to the driver. Quick sample:

>>> test = sc.parallelize([1,2,3])
>>> for i in test:
...    print i
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'RDD' object is not iterable

This will not work because test is an RDD. On the other hand, if you bring your data back to the Driver with an action, now it will be an object over which you can iterate, for example:

>>> for i in test.collect():
...    print i
1                                                                               
2
3

There you go, call an action and bring the data back to the Driver, being careful of not having too much data or you can get an out of memory exception



来源:https://stackoverflow.com/questions/47733799/can-not-access-pipelined-rdd-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!