How to keep a random subset of a stream of data?

感情迁移 提交于 2019-12-03 05:56:49

问题


I have a stream of events flowing through my servers. It is not feasible for me to store all of them, but I would like to periodically be able to process some of them in aggregate. So, I want to keep a subset of the stream that is a random sampling of everything I've seen, but is capped to a max size.

So, for each new item, I need an algorithm to decide if I should add it to the stored set, or if I should discard it. If I add it, and I'm already at my limit, I need an algorithm to evict one of the old items.

Obviously, this is easy as long as I'm below my limit (just save everything). But how can I maintain a good random sampling without being biased towards old items or new items once I'm past that limit?

Thanks,


回答1:


This is a common interview question.

One easy way to do it is to save the nth element with probability k/n (or 1, whichever is lesser). If you need to remove an element to save the new sample, evict a random element.

This gives you a uniformly random subset of the n elements. If you don't know n, you can estimate it and get an approximately uniform subset.




回答2:


This is called random sampling. Source: http://en.wikipedia.org/wiki/Reservoir_sampling

array R[k];    // result
integer i, j;

// fill the reservoir array
for each i in 1 to k do
    R[i] := S[i]
done;

// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
    j := random(1, i);   // important: inclusive range
    if j <= k then
        R[j] := S[i]
    fi
done

A decent explanation/proof: http://propersubset.com/2010/04/choosing-random-elements.html




回答3:


While this paper isn't precisely what you're looking for, it may be a good starting point in your search.




回答4:


store samples in a first in first out (FIFO) queue.

set a sampling rate of so many events between samples, or randomize this a bit - depending on your patterns of events.

save every nth event, or whenever your rate tells you to, then stick it in to the end of the queue.

pop one off the top if the size is too big.




回答5:


This is assuming you dont know the total number of events that will be received and that you don't need a minimum number of elements in the subset.

arr = arr[MAX_SIZE] //Create a new array that will store the events. Assuming first index 1.
counter = 1 //Initialize a counter.

while(receiving event){
  random = //Generate a random number between 1 and counter
  if( counter == random ){
     if( counter <= MAX_SIZE ){
        arr[counter] = event
     }
     else{
        tmpRandom = //Generate a random number between 1 and MAX_SIZE
        arr[tmpRandom] = event
     }
  }
  counter =+ 1

}



回答6:


Assign a probability of recording each event and store the event in an indexable data structure. When the size of the structure gets to the threshold, remove a random element and add new elements. In Ruby, you could do this:

@storage = []
prob = 0.002

while ( message = getnextMessage) do
    @storage.delete((rand() * @storage.length).floor) if @storage.length > MAX_LEN
    @storage << message if (rand() < prob) 
end

This addresses your max size AND your non-bias toward when the event occurred. You could also choose which element gets deleted by partitioning your stored elements into buckets and then removing an element from any bucket that has more than one element. The bucket method allows you to keep one from each hour, for example.

You should also know that sampling theory is Big Math. If you need more than a layman's idea about this you should consult a qualified mathematician in your area.



来源:https://stackoverflow.com/questions/4340380/how-to-keep-a-random-subset-of-a-stream-of-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!