Efficient algorithm to randomly select items with frequency

前端 未结 3 1533
孤独总比滥情好
孤独总比滥情好 2021-01-02 06:52

Given an array of n word-frequency pairs:

[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]<         


        
3条回答
  •  感情败类
    2021-01-02 07:29

    Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:

    • There are n partitions, all of the same width r s.t. nr = m.
    • each partition contains two words in some ratio (which is stored with the partition).
    • for each word wi, fi = ∑partitions t s.t wi ∈ t r × ratio(t,wi)

    Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1 at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p selections can be done in O(p) work, given such a partition.

    The reason that such a partitioning exists is that there exists a word wi s.t. fi < r, if and only if there exists a word wi' s.t. fi' > r, since r is the average of the frequencies.

    Given such a pair wi and wi' we can replace them with a pseudo-word w'i of frequency f'i = r (that represents wi with probability fi/r and wi' with probability 1 - fi/r) and a new word w'i' of adjusted frequency f'i' = fi' - (r - fi) respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.

    To construct this partition in O(n) time,

    • go through the list of the words once, constructing two lists:
      • one of words with frequency ≤ r
      • one of words with frequency > r
    • then pull a word from the first list
      • if its frequency = r, then make it into a one element partition
      • otherwise, pull a word from the other list, and use it to fill out a two-word partition. Then put the second word back into either the first or second list according to its adjusted frequency.

    This actually still works if the number of partitions q > n (you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factor q of m s.t. q > n, you can pad all the frequencies by a factor of n, so f'i = nfi, which updates m' = mn and sets r' = m when q = n.

    In any case, this algorithm only takes O(n + p) work, which I have to think is optimal.

    In ruby:

    def weighted_sample_with_replacement(input, p)
      n = input.size
      m = input.inject(0) { |sum,(word,freq)| sum + freq }
    
      # find the words with frequency lesser and greater than average
      lessers, greaters = input.map do |word,freq| 
                            # pad the frequency so we can keep it integral
                            # when subdivided
                            [ word, freq*n ] 
                          end.partition do |word,adj_freq| 
                            adj_freq <= m 
                          end
    
      partitions = Array.new(n) do
        word, adj_freq = lessers.shift
    
        other_word = if adj_freq < m
                       # use part of another word's frequency to pad
                       # out the partition
                       other_word, other_adj_freq = greaters.shift
                       other_adj_freq -= (m - adj_freq)
                       (other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
                       other_word
                     end
    
        [ word, other_word , adj_freq ]
      end
    
      (0...p).map do 
        # pick a partition at random
        word, other_word, adj_freq = partitions[ rand(n) ]
        # select the first word in the partition with appropriate
        # probability
        if rand(m) < adj_freq
          word
        else
          other_word
        end
      end
    end
    

提交回复
热议问题