问题
Following on from this answer I want to know what the best way to use PostgreSQL's built-in full text search is if I want to sort by rank, and limit to only matching queries.
Let's assume a very simple table.
CREATE TABLE pictures (
id SERIAL PRIMARY KEY,
title varchar(300),
...
)
or whatever. Now I want to search the title
field. First I create an index:
CREATE INDEX pictures_title ON pictures
USING gin(to_tsvector('english', title));
Now I want to search for 'small dog'
. This works:
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), 'small dog'
) AS score
FROM pictures
ORDER BY score DESC
But what I really want is this:
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), to_tsquery('small dog')
) AS score
FROM pictures
WHERE to_tsvector('english', pictures.title) @@ to_tsquery('small dog')
ORDER BY score DESC
Or alternatively this (which doesn't work - can't use score
in the WHERE
clause):
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), to_tsquery('small dog')
) AS score
FROM pictures WHERE score > 0
ORDER BY score DESC
What's the best way to do this? My questions are many-fold:
- If I use the version with repeated
to_tsvector(...)
will it call that twice, or is it smart enough to cache the results somehow? - Is there a way to do it without repeating the
to_ts...
function calls? - Is there a way to use
score
in theWHERE
clause at all? - If there is, would it be better to filter by
score > 0
or use the@@
thing?
回答1:
The use of the @@
operator will utilize the full text GIN index, while the test for score > 0
would not.
I created a table as in the Question, but added a column named title_tsv
:
CREATE TABLE test_pictures (
id BIGSERIAL,
title text,
title_tsv tsvector
);
CREATE INDEX ix_pictures_title_tsv ON test_pictures
USING gin(title_tsv);
I populated the table with some test data:
INSERT INTO test_pictures(title, title_tsv)
SELECT T.data, to_tsvector(T.data)
FROM some_table T;
Then I ran the previously accepted answer with explain analyze
:
EXPLAIN ANALYZE
SELECT score, id, title
FROM (
SELECT ts_rank_cd(P.title_tsv, to_tsquery('address & shipping')) AS score
,P.id
,P.title
FROM test_pictures as P
) S
WHERE score > 0
ORDER BY score DESC;
And got the following. Note the execution time of 5,015 ms
QUERY PLAN |
----------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge (cost=274895.48..323298.03 rows=414850 width=60) (actual time=5010.844..5011.330 rows=1477 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Sort (cost=273895.46..274414.02 rows=207425 width=60) (actual time=4994.539..4994.555 rows=492 loops=3) |
Sort Key: (ts_rank_cd(p.title_tsv, to_tsquery('address & shipping'::text))) DESC |
Sort Method: quicksort Memory: 131kB |
-> Parallel Seq Scan on test_pictures p (cost=0.00..247776.02 rows=207425 width=60) (actual time=17.672..4993.997 rows=492 loops=3) |
Filter: (ts_rank_cd(title_tsv, to_tsquery('address & shipping'::text)) > '0'::double precision) |
Rows Removed by Filter: 497296 |
Planning time: 0.159 ms |
Execution time: 5015.664 ms |
Now compare that with the @@
operator:
EXPLAIN ANALYZE
SELECT ts_rank_cd(to_tsvector(P.title), to_tsquery('address & shipping')) AS score
,P.id
,P.title
FROM test_pictures as P
WHERE P.title_tsv @@ to_tsquery('address & shipping')
ORDER BY score DESC;
And the results coming in with an execution time of about 29 ms:
QUERY PLAN |
-------------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge (cost=13884.42..14288.35 rows=3462 width=60) (actual time=26.472..26.942 rows=1477 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Sort (cost=12884.40..12888.73 rows=1731 width=60) (actual time=17.507..17.524 rows=492 loops=3) |
Sort Key: (ts_rank_cd(to_tsvector(title), to_tsquery('address & shipping'::text))) DESC |
Sort Method: quicksort Memory: 171kB |
-> Parallel Bitmap Heap Scan on test_pictures p (cost=72.45..12791.29 rows=1731 width=60) (actual time=1.781..17.268 rows=492 loops=3) |
Recheck Cond: (title_tsv @@ to_tsquery('address & shipping'::text)) |
Heap Blocks: exact=625 |
-> Bitmap Index Scan on ix_pictures_title_tsv (cost=0.00..71.41 rows=4155 width=0) (actual time=3.765..3.765 rows=1477 loops=1) |
Index Cond: (title_tsv @@ to_tsquery('address & shipping'::text)) |
Planning time: 0.214 ms |
Execution time: 28.995 ms |
As you can see in the execution plan, the index ix_pictures_title_tsv
was used in the second query, but not in the first one, making the query with the @@
operator a whopping 172 times faster!
回答2:
select *
from (
SELECT
pictures.id,
ts_rank_cd(to_tsvector('english', pictures.title),
to_tsquery('small dog')) AS score
FROM pictures
) s
WHERE score > 0
ORDER BY score DESC
回答3:
If I use the version with repeated to_tsvector(...) will it call that twice, or is it smart enough to cache the results somehow?
The best way to notice these things is to do a simple explain, although those can be hard to read.
Long story short, yes, PostgreSQL is smart enough to reuse computed results.
Is there a way to do it without repeating the to_ts... function calls?
What I usually do is add a tsv
column which is the text search vector. If you make this auto update using triggers it immediately gives you the vector easily accessible but it also allows you to selectively update the search index by making the trigger selective.
Is there a way to use score in the WHERE clause at all?
Yes, but not with that name. Alternatively you could create a sub-query, but I would personally just repeat it.
If there is, would it be better to filter by score > 0 or use the @@ thing?
The simplest version I can think of is this:
SELECT *
FROM pictures
WHERE 'small dog' @@ text_search_vector
The text_search_vector
could obviously be replaced with something like to_tsvector('english', pictures.title)
来源:https://stackoverflow.com/questions/12933805/best-way-to-use-postgresql-full-text-search-ranking