As Spark\'s mllib doesn\'t have nearest-neighbors functionality, I\'m trying to use Annoy for approximate Nearest Neighbors. I try to broadcast the Annoy object and pass it
Just in case anyone else is following along here like I was, you'll need to import Annoy in the mapPartitions function, else you'll still get pickling errors. Here's my completed example based on the above:
from annoy import AnnoyIndex
from pyspark import SparkFiles
from pyspark import SparkContext
from pyspark import SparkConf
import random
random.seed(42)
f = 1024
t = AnnoyIndex(f)
allvectors = []
for i in range(100):
v = [random.gauss(0, 1) for z in range(f)]
t.add_item(i, v)
allvectors.append((i, v))
t.build(10)
t.save("index.ann")
def find_neighbors(i):
from annoy import AnnoyIndex
ai = AnnoyIndex(f)
ai.load(SparkFiles.get("index.ann"))
return (ai.get_nns_by_vector(vector=x[1], n=5) for x in i)
with SparkContext(conf=SparkConf().setAppName("myannoy")) as sc:
sc.addFile("index.ann")
sparkvectors = sc.parallelize(allvectors)
sparkvectors.mapPartitions(find_neighbors).first()