问题
Suppose we wish to implement Local Sensitive Hashing(LSH) by MapReduce. Specifically, assume chunks of the signature matrix consist of columns, and elements are key-value pairs where the key is the column number and the value is the signature itself (i.e., a vector of values).
(a) Show how to produce the buckets for all the bands as output of a single MapReduce process. Hint: Remember that a Map function can produce several key-value pairs from a single element.
(b) Show how another MapReduce process can convert the output of (a) to a list of pairs that need to be compared. Specifically, for each column i, there should be a list of those columns j > i with which i needs to be compared.
回答1:
(a)
- Map: the elements and its signature as input, produce the key-value pairs (bucket_id, element)
- Reduce: produce the buckets for all the bands as output, i.e. (bucket_id, list(elements))
map(key, value: element):
split item to bands
for band in bands:
for sig in band:
key = hash(sig) // key = bucket id
collect(key, value)
reduce(key, values):
collect(key, values)
(b)
- Map: output of (a) as input, produce the list of combination in same bucket, i.e. (bucket_id, list(elements)) -> (bucket_id, combination(list(elements))), which combination() is any two elements chosen from same bucket.
- Reduce: output the item pairs need to be compared, Specifically, for each column i, there should be a list of those columns j > i with which i needs to be compared.
map(key, value):
for itemA, itemB in combinations(value)
key = (itemA.id, itemB.id)
collect(key, [itemA, itemB])
reduce(key, values):
collect(key, values)
来源:https://stackoverflow.com/questions/29320943/how-to-implement-lsh-by-mapreduce