I'm trying to implement KMeans algorithm using Pyspark it gives me the above error in the last line of the while loop. it works fine outside the loop but after I created the loop it gave me this error How do I fix this ?
# Find K Means of Loudacre device status locations
#
# Input data: file(s) with device status data (delimited by '|')
# including latitude (13th field) and longitude (14th field) of device locations
# (lat,lon of 0,0 indicates unknown location)
# NOTE: Copy to pyspark using %paste
# for a point p and an array of points, return the index in the array of the point closest to p
def closestPoint(p, points):
bestIndex = 0
closest = float("+inf")
# for each point in the array, calculate the distance to the test point, then return
# the index of the array point with the smallest distance
for i in range(len(points)):
dist = distanceSquared(p,points[i])
if dist < closest:
closest = dist
bestIndex = i
return bestIndex
# The squared distances between two points
def distanceSquared(p1,p2):
return (p1[0] - p2[0]) ** 2 + (p1[1] - p2[1]) ** 2
# The sum of two points
def addPoints(p1,p2):
return [p1[0] + p2[0], p1[1] + p2[1]]
# The files with device status data
filename = "/loudacre/devicestatus_etl/*"
# K is the number of means (center points of clusters) to find
K = 5
# ConvergeDist -- the threshold "distance" between iterations at which we decide we are done
convergeDist=.1
# Parse device status records into [latitude,longitude]
rdd2=rdd1.map(lambda line:(float((line.split(",")[3])),float((line.split(",")[4]))))
# Filter out records where lat/long is unavailable -- ie: 0/0 points
# TODO
filterd=rdd2.filter(lambda x:x!=(0,0))
# start with K randomly selected points from the dataset
# TODO
sample=filterd.takeSample(False,K,42)
# loop until the total distance between one iteration's points and the next is less than the convergence distance specified
tempDist =float("+inf")
while tempDist > convergeDist:
# for each point, find the index of the closest kpoint. map to (index, (point,1))
# TODO
indexed =filterd.map(lambda (x1,x2):(closestPoint((x1,x2),sample),((x1,x2),1)))
# For each key (k-point index), reduce by adding the coordinates and number of points
reduced=indexed.reduceByKey(lambda x,y: ((x[0][0]+y[0][0],x[0][1]+y[0][1]),x[1]+y[1]))
# For each key (k-point index), find a new point by calculating the average of each closest point
# TODO
newCenters=reduced.mapValues(lambda x1: [x1[0][0]/x1[1], x1[0][1]/x1[1]]).sortByKey()
# calculate the total of the distance between the current points and new points
newSample=newCenters.collect() #new centers as a list
samples=zip(newSample,sample) #sample=> old centers
samples1=sc.parallelize(samples)
totalDistance=samples1.map(lambda x:distanceSquared(x[0][1],x[1]))
# Copy the new points to the kPoints array for the next iteration
tempDist=totalDistance.sum()
sample=map(lambda x:x[1],samples) #new sample for next iteration as list
sample
You are getting this error because you are trying to get len
of map
object (of generator type) which do not supports len
. For example:
>>> x = [[1, 'a'], [2, 'b'], [3, 'c']]
# `map` returns object of map type
>>> map(lambda a: a[0], x)
<map object at 0x101b75ba8>
# on doing `len`, raises error
>>> len(map(lambda a: a[0], x))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object of type 'map' has no len()
In order to find the length, you will have to type-cast the map
to list
(or tuple
) and then you may call len
over it. For example:
>>> len(list(map(lambda a: a[0], x)))
3
Or it is even better to simply create a list using the list comprehension (without using map
) as:
>>> my_list = [a[0] for a in x]
# since it is a `list`, you can take it's length
>>> len(my_list)
3
来源:https://stackoverflow.com/questions/41903852/typeerror-object-of-type-map-has-no-len-python3