Efficient algorithm to randomly select items with frequency

前端未结

关注

 3  1533

孤独总比滥情好 2021-01-02 06:52

Given an array of n word-frequency pairs:

[ (w₀, f₀), (w₁, f₁), ..., (w_n-1, f_n-1) ]<

3条回答

感情败类 (楼主)

2021-01-02 07:29
Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:
- There are n partitions, all of the same width r s.t. nr = m.
- each partition contains two words in some ratio (which is stored with the partition).
- for each word w_i, f_i = ∑_{partitions t s.t w_i ∈ t} r × ratio(t,w_i)
Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1 at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p selections can be done in O(p) work, given such a partition.

The reason that such a partitioning exists is that there exists a word w_i s.t. f_i < r, if and only if there exists a word w_i' s.t. f_i' > r, since r is the average of the frequencies.

Given such a pair w_i and w_i' we can replace them with a pseudo-word w'_i of frequency f'_i = r (that represents w_i with probability f_i/r and w_i' with probability 1 - f_i/r) and a new word w'_i' of adjusted frequency f'_i' = f_i' - (r - f_i) respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.

To construct this partition in O(n) time,
- go through the list of the words once, constructing two lists:
  - one of words with frequency ≤ r
  - one of words with frequency > r
- then pull a word from the first list
  - if its frequency = r, then make it into a one element partition
  - otherwise, pull a word from the other list, and use it to fill out a two-word partition. Then put the second word back into either the first or second list according to its adjusted frequency.
This actually still works if the number of partitions q > n (you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factor q of m s.t. q > n, you can pad all the frequencies by a factor of n, so f'_i = nf_i, which updates m' = mn and sets r' = m when q = n.

In any case, this algorithm only takes O(n + p) work, which I have to think is optimal.

In ruby:
```
def weighted_sample_with_replacement(input, p)
  n = input.size
  m = input.inject(0) { |sum,(word,freq)| sum + freq }

  # find the words with frequency lesser and greater than average
  lessers, greaters = input.map do |word,freq| 
                        # pad the frequency so we can keep it integral
                        # when subdivided
                        [ word, freq*n ] 
                      end.partition do |word,adj_freq| 
                        adj_freq <= m 
                      end

  partitions = Array.new(n) do
    word, adj_freq = lessers.shift

    other_word = if adj_freq < m
                   # use part of another word's frequency to pad
                   # out the partition
                   other_word, other_adj_freq = greaters.shift
                   other_adj_freq -= (m - adj_freq)
                   (other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
                   other_word
                 end

    [ word, other_word , adj_freq ]
  end

  (0...p).map do 
    # pick a partition at random
    word, other_word, adj_freq = partitions[ rand(n) ]
    # select the first word in the partition with appropriate
    # probability
    if rand(m) < adj_freq
      word
    else
      other_word
    end
  end
end
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...