Implementing Reservoir Sampling using Map Reduce
问题 This link "http://had00b.blogspot.com/2013/07/random-subset-in-mapreduce.html" talks about how one can implement reservoir sampling using map reduce framework. I feel their solution is complicated and the following simpler approach would work. Problem: Given very large number of samples, generate a set of size k such that each sample has equal probability of being present in the set. Proposed solution: Map operation: For each input number n, output (i, n) where i is randomly chosen in range 0