quick random row selection in Postgres

爱⌒轻易说出口 提交于 2019-11-26 12:51:01
NPE

You might want to experiment with OFFSET, as in

SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;

The N is the number of rows in mytable. You may need to first do a SELECT COUNT(*) to figure out the value of N.

Update (by Antony Hatchkins)

You must use floor here:

SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;

Consider a table of 2 rows; random()*N generates 0 <= x < 2 and for example SELECT myid FROM mytable OFFSET 1.7 LIMIT 1; returns 0 rows because of implicit rounding to nearest int.

alfonx

PostgreSQL 9.5 introduced a new approach for much faster sample selection: TABLESAMPLE

The syntax is

SELECT * FROM my_table TABLESAMPLE BERNOULLI(percentage);
SELECT * FROM my_table TABLESAMPLE SYSTEM(percentage);

This is not the optimal solution if you want only one row selected, because you need to know the COUNT of the table to calculate the exact percentage.

To avoid a slow COUNT and use fast TABLESAMPLE for tables from 1 row to billions of rows, you can do:

 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.000001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.00001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.0001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.001) LIMIT 1;
 ...

This might not look so elegant, but probably is faster than any of the other answers.

To decide whether you want to use BERNULLI oder SYSTEM, read about the difference at http://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/

John Coryat

I tried this with a subquery and it worked fine. Offset, at least in Postgresql v8.4.4 works fine.

select * from mytable offset random() * (select count(*) from mytable) limit 1 ;

You need to use floor:

SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;
Kuberchaun

Check this link out for some different options. http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/

Update: (A.Hatchkins)

The summary of the (very) long article is as follows.

The author lists four approaches:

1) ORDER BY random() LIMIT 1; -- slow

2) ORDER BY id where id>=random()*N LIMIT 1 -- nonuniform if there're gaps

3) random column -- needs to be updated every now and then

4) custom random aggregate -- cunning method, could be slow: random() needs to be generated N times

and suggests to improve method #2 by using

5) ORDER BY id where id=random()*N LIMIT 1 with subsequent requeries if the result is empty.

I've came up with a very fast solution without TABLESAMPLE. Much faster than OFFSET random()*N LIMIT 1. It doesn't even require table count.

The idea is to create an expression index with random but predictable data, for example md5(primary key).

Here is a test with 1M rows sample data:

create table randtest (id serial primary key, data int not null);

insert into randtest (data) select (random()*1000000)::int from generate_series(1,1000000);

create index randtest_md5_id_idx on randtest (md5(id::text));

explain analyze
select * from randtest where md5(id::text)>md5(random()::text)
order by md5(id::text) limit 1;

Result:

 Limit  (cost=0.42..0.68 rows=1 width=8) (actual time=6.219..6.220 rows=1 loops=1)
   ->  Index Scan using randtest_md5_id_idx on randtest  (cost=0.42..84040.42 rows=333333 width=8) (actual time=6.217..6.217 rows=1 loops=1)
         Filter: (md5((id)::text) > md5((random())::text))
         Rows Removed by Filter: 1831
 Total runtime: 6.245 ms

This query can sometimes (with about 1/Number_of_rows probability) return 0 rows, so it needs to be checked and rerun. Also probabilities aren't exactly the same - some rows are more probable than others.

For comparison:

explain analyze SELECT id FROM randtest OFFSET random()*1000000 LIMIT 1;

Results vary widely, but can be pretty bad:

 Limit  (cost=1442.50..1442.51 rows=1 width=4) (actual time=179.183..179.184 rows=1 loops=1)
   ->  Seq Scan on randtest  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.016..134.835 rows=915702 loops=1)
 Total runtime: 179.211 ms
(3 rows)
daamien

The easiest and fastest way to fetch random row is to use the tsm_system_rows extension :

CREATE EXTENSION IF NOT EXISTS tsm_system_rows;

Then you can select the exact number of rows you want :

SELECT myid  FROM mytable TABLESAMPLE SYSTEM_ROWS(1);

This is available with PostgreSQL 9.5 and later.

See: https://www.postgresql.org/docs/current/static/tsm-system-rows.html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!