Select random row from a PostgreSQL table with weighted row probabilities

问题

Example input:

SELECT * FROM test;
 id | percent   
----+----------
  1 | 50 
  2 | 35   
  3 | 15   
(3 rows)

How would you write such query, that on average 50% of time i could get the row with id=1, 35% of time row with id=2, and 15% of time row with id=3?

I tried something like SELECT id FROM test ORDER BY p * random() DESC LIMIT 1, but it gives wrong results. After 10,000 runs I get a distribution like: {1=6293, 2=3302, 3=405}, but I expected the distribution to be nearly: {1=5000, 2=3500, 3=1500}.

Any ideas?

回答1:

This should do the trick:

WITH CTE AS (
    SELECT random() * (SELECT SUM(percent) FROM YOUR_TABLE) R
)
SELECT *
FROM (
    SELECT id, SUM(percent) OVER (ORDER BY id) S, R
    FROM YOUR_TABLE CROSS JOIN CTE
) Q
WHERE S >= R
ORDER BY id
LIMIT 1;

The sub-query Q gives the following result:

1  50
2  85
3  100

We then simply generate a random number in range [0, 100) and pick the first row that is at or beyond that number (the WHERE clause). We use common table expression (WITH) to ensure the random number is calculated only once.

BTW, the SELECT SUM(percent) FROM YOUR_TABLE allows you to have any weights in percent - they don't strictly need to be percentages (i.e. add-up to 100).

[SQL Fiddle]

回答2:

ORDER BY random() ^ (1.0 / p)

from the algorithm described by Efraimidis and Spirakis.

回答3:

Your proposed query appears to work; see this SQLFiddle demo. It creates the wrong distribution though; see below.

To prevent PostgreSQL from optimising the subquery I've wrapped it in a VOLATILE SQL function. PostgreSQL has no way to know that you intend the subquery to run once for every row of the outer query, so if you don't force it to volatile it'll just execute it once. Another possibility - though one that the query planner might optimize out in future - is to make it appear to be a correlated subquery, like this hack that uses an always-true where clause, like this: http://sqlfiddle.com/#!12/3039b/9

At a guess (before you updated to explain why it didn't work) your testing methodology was at fault, or you're using this as a subquery in an outer query where PostgreSQL is noticing it isn't a correlated subquery and executing it just once, like in this example. .

UPDATE: The distribution produced isn't what you're expecting. The issue here is that you're skewing the distribution by taking multiple samples of random(); you need a single sample.

This query produces the correct distribution (SQLFiddle):

WITH random_weight(rw) AS (SELECT random() * (SELECT sum(percent) FROM test))
 SELECT id
FROM (                   
  SELECT 
    id,
    sum(percent) OVER (ORDER BY id),
    coalesce(sum(prev_percent) OVER (ORDER BY id),0) FROM (
      SELECT 
        id,
        percent,
        lag(percent) OVER () AS prev_percent
      FROM test
    ) x
) weighted_ids(id, weight_upper, weight_lower)
CROSS JOIN random_weight
WHERE rw BETWEEN weight_lower AND weight_upper;

Performance is, needless to say, horrible. It's using two nested sets of windows. What I'm doing is:

Creating (id, percent, previous_percent) then using that to create two running sums of weights that are used as range brackets; then
Taking a random value, scaling it to the range of weights, and then picking a value that has weights within the target bracket

回答4:

Here is something for you to play with:

select t1.id as id1
  , case when t2.id is null then 0 else t2.id end as id2
  , t1.percent as percent1
  , case when t2.percent is null then 0 else t2.percent end as percent2 
from "Test1" t1 
  left outer join "Test1" t2 on t1.id = t2.id + 1
where random() * 100 between t1.percent and 
  case when t2.percent is null then 0 else t2.percent end;

Essentially perform a left outer join so that you have two columns to apply a between clause.

Note that it will only work if you get your table ordered in the right way.

回答5:

Based on Branko Dimitrijevic's answer, I wrote this query, which may or may not be faster by using the sum total of percent using tiered windowing functions (not unlike a ROLLUP).

WITH random AS (SELECT random() AS random)
SELECT id FROM (
    SELECT id, percent,
    SUM(percent) OVER (ORDER BY id) AS rank,
    SUM(percent) OVER () * random AS roll
    FROM test CROSS JOIN random
) t WHERE roll <= rank LIMIT 1

If the ordering isn't important, SUM(percent) OVER (ROWS UNBOUNDED PRECEDING) AS rank, may be preferable because it avoids having to sort the data first.

I also tried Mechanic Wei's answer (as described in this paper, apparently), which seems very promising in terms of performance, but after some testing, the distribution appear to be off :

SELECT id
FROM test
ORDER BY random() ^ (1.0/percent)
LIMIT 1

来源：https://stackoverflow.com/questions/13040246/select-random-row-from-a-postgresql-table-with-weighted-row-probabilities

标签

sql

postgresql

statistics

probability