Implementing Reservoir Sampling using Map Reduce

≯℡__Kan透↙ 提交于 2019-12-13 04:29:37

问题


This link "http://had00b.blogspot.com/2013/07/random-subset-in-mapreduce.html" talks about how one can implement reservoir sampling using map reduce framework. I feel their solution is complicated and the following simpler approach would work.

Problem: Given very large number of samples, generate a set of size k such that each sample has equal probability of being present in the set.

Proposed solution:

  1. Map operation: For each input number n, output (i, n) where i is randomly chosen in range 0 to k-1.
  2. Reduce operation: Among all numbers with same key, pick one randomly.

Claim: Probability of any number being in set of k size is k/n (where n is total number of samples)

Proof intuition:

Since map operation randomly assigned each input sample to bucket number i (0 <= i <= k-1), size of each bucket would be n/k. Now each number is present only in one bucket, suppose bucket i. The probability that it gets picked in reduce operation for bucket i is 1/(n/k) = k/n

I would appreciate any second thoughts on my solution, whether it seems correct or not.


回答1:


There's a small flaw in your argument. Your algorithm may not return a sample of size k, as it may happen that none of the element gets mapped to a specific key. In the extreme case (even if it has small chance), it may happen that all the input elements get mapped to only one key, in which case your algorithm returns only one element.

The event of "missing" a specific key has probability ((k-1)/k)^n = (1-1/k)^n which is approximately (using Taylor approximation) e^{-n/k}. This is negligible if k is much smaller than n, but if k is proportional to n, say k=n/2, then this bad event actually happens with constant probability.



来源:https://stackoverflow.com/questions/18166144/implementing-reservoir-sampling-using-map-reduce

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!