Example input:
SELECT * FROM test; id | percent ----+---------- 1 | 50 2 | 35 3 | 15 (3 rows)
How would you write such query, that on average 50% of time i could get the row with id=1, 35% of time row with id=2, and 15% of time row with id=3?
I tried something like SELECT id FROM test ORDER BY p * random() DESC LIMIT 1
, but it gives wrong results. After 10,000 runs I get a distribution like: {1=6293, 2=3302, 3=405}
, but I expected the distribution to be nearly: {1=5000, 2=3500, 3=1500}
.
Any ideas?
This should do the trick:
WITH CTE AS (
SELECT random() * (SELECT SUM(percent) FROM YOUR_TABLE) R
)
SELECT *
FROM (
SELECT id, SUM(percent) OVER (ORDER BY id) S, R
FROM YOUR_TABLE CROSS JOIN CTE
) Q
WHERE S >= R
ORDER BY id
LIMIT 1;
The sub-query Q
gives the following result:
1 50
2 85
3 100
We then simply generate a random number in range [0, 100) and pick the first row that is at or beyond that number (the WHERE
clause). We use common table expression (WITH
) to ensure the random number is calculated only once.
BTW, the SELECT SUM(percent) FROM YOUR_TABLE
allows you to have any weights in percent
- they don't strictly need to be percentages (i.e. add-up to 100).
ORDER BY random() ^ (1.0 / p)
from the algorithm described by Efraimidis and Spirakis.
Your proposed query appears to work; see this SQLFiddle demo. It creates the wrong distribution though; see below.
To prevent PostgreSQL from optimising the subquery I've wrapped it in a VOLATILE
SQL function. PostgreSQL has no way to know that you intend the subquery to run once for every row of the outer query, so if you don't force it to volatile it'll just execute it once. Another possibility - though one that the query planner might optimize out in future - is to make it appear to be a correlated subquery, like this hack that uses an always-true where clause, like this: http://sqlfiddle.com/#!12/3039b/9
At a guess (before you updated to explain why it didn't work) your testing methodology was at fault, or you're using this as a subquery in an outer query where PostgreSQL is noticing it isn't a correlated subquery and executing it just once, like in this example. .
UPDATE: The distribution produced isn't what you're expecting. The issue here is that you're skewing the distribution by taking multiple samples of random()
; you need a single sample.
This query produces the correct distribution (SQLFiddle):
WITH random_weight(rw) AS (SELECT random() * (SELECT sum(percent) FROM test))
SELECT id
FROM (
SELECT
id,
sum(percent) OVER (ORDER BY id),
coalesce(sum(prev_percent) OVER (ORDER BY id),0) FROM (
SELECT
id,
percent,
lag(percent) OVER () AS prev_percent
FROM test
) x
) weighted_ids(id, weight_upper, weight_lower)
CROSS JOIN random_weight
WHERE rw BETWEEN weight_lower AND weight_upper;
Performance is, needless to say, horrible. It's using two nested sets of windows. What I'm doing is:
- Creating (id, percent, previous_percent) then using that to create two running sums of weights that are used as range brackets; then
- Taking a random value, scaling it to the range of weights, and then picking a value that has weights within the target bracket
Here is something for you to play with:
select t1.id as id1
, case when t2.id is null then 0 else t2.id end as id2
, t1.percent as percent1
, case when t2.percent is null then 0 else t2.percent end as percent2
from "Test1" t1
left outer join "Test1" t2 on t1.id = t2.id + 1
where random() * 100 between t1.percent and
case when t2.percent is null then 0 else t2.percent end;
Essentially perform a left outer join so that you have two columns to apply a between clause.
Note that it will only work if you get your table ordered in the right way.
Based on Branko Dimitrijevic's answer, I wrote this query, which may or may not be faster by using the sum total of percent
using tiered windowing functions (not unlike a ROLLUP
).
WITH random AS (SELECT random() AS random)
SELECT id FROM (
SELECT id, percent,
SUM(percent) OVER (ORDER BY id) AS rank,
SUM(percent) OVER () * random AS roll
FROM test CROSS JOIN random
) t WHERE roll <= rank LIMIT 1
If the ordering isn't important, SUM(percent) OVER (ROWS UNBOUNDED PRECEDING) AS rank,
may be preferable because it avoids having to sort the data first.
I also tried Mechanic Wei's answer (as described in this paper, apparently), which seems very promising in terms of performance, but after some testing, the distribution appear to be off :
SELECT id
FROM test
ORDER BY random() ^ (1.0/percent)
LIMIT 1
来源:https://stackoverflow.com/questions/13040246/select-random-row-from-a-postgresql-table-with-weighted-row-probabilities