Given an array of n
word-frequency pairs:
[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]<
Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:
n
partitions, all of the same width r
s.t. nr = m
.wi
, fi = ∑partitions t s.t wi ∈ t r × ratio(t,wi)
Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1
at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p
selections can be done in O(p)
work, given such a partition.
The reason that such a partitioning exists is that there exists a word wi
s.t. fi < r
, if and only if there exists a word wi'
s.t. fi' > r
, since r is the average of the frequencies.
Given such a pair wi
and wi'
we can replace them with a pseudo-word w'i
of frequency f'i = r
(that represents wi
with probability fi/r
and wi'
with probability 1 - fi/r
) and a new word w'i'
of adjusted frequency f'i' = fi' - (r - fi)
respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.
To construct this partition in O(n)
time,
This actually still works if the number of partitions q > n
(you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factor q
of m
s.t. q > n
, you can pad all the frequencies by a factor of n
, so f'i = nfi
, which updates m' = mn
and sets r' = m
when q = n
.
In any case, this algorithm only takes O(n + p)
work, which I have to think is optimal.
In ruby:
def weighted_sample_with_replacement(input, p)
n = input.size
m = input.inject(0) { |sum,(word,freq)| sum + freq }
# find the words with frequency lesser and greater than average
lessers, greaters = input.map do |word,freq|
# pad the frequency so we can keep it integral
# when subdivided
[ word, freq*n ]
end.partition do |word,adj_freq|
adj_freq <= m
end
partitions = Array.new(n) do
word, adj_freq = lessers.shift
other_word = if adj_freq < m
# use part of another word's frequency to pad
# out the partition
other_word, other_adj_freq = greaters.shift
other_adj_freq -= (m - adj_freq)
(other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
other_word
end
[ word, other_word , adj_freq ]
end
(0...p).map do
# pick a partition at random
word, other_word, adj_freq = partitions[ rand(n) ]
# select the first word in the partition with appropriate
# probability
if rand(m) < adj_freq
word
else
other_word
end
end
end