Postgresql 9.x: Index to optimize `xpath_exists` (XMLEXISTS) queries

问题

We have queries of the form

select sum(acol)
where xpath_exists('/Root/KeyValue[Key="val"]/Value//text()', xmlcol)

What index can be built to speed up the where clause ?

A btree index created using

create index idx_01 using btree(xpath_exists('/Root/KeyValue[Key="val"]/Value//text()', xmlcol))

does not seem to be used at all.

EDIT

Setting enable_seqscan to off, the query using xpath_exists is much faster (one order of magnitude) and clearly shows using the corresponding index (the btree index built with xpath_exists).

Any clue why PostgreSQL would not be using the index and attempt a much slower sequential scan ?

Since I do not want to disable sequential scanning globally, I am back to square one and I am happily welcoming suggestions.

EDIT 2 - Explain plans

See below - Cost of first plan (seqscan off) is slightly higher but processing time much faster

b2box=# set enable_seqscan=off;
SET
b2box=# explain analyze
Select count(*) 
from B2HEAD.item
where cluster = 'B2BOX' and (  ( xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()', content) )  )  offset 0 limit 1;
                                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=22766.63..22766.64 rows=1 width=0) (actual time=606.042..606.042 rows=1 loops=1)
   ->  Aggregate  (cost=22766.63..22766.64 rows=1 width=0) (actual time=606.039..606.039 rows=1 loops=1)
         ->  Bitmap Heap Scan on item  (cost=1058.65..22701.38 rows=26102 width=0) (actual time=3.290..603.823 rows=4085 loops=1)
               Filter: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) AND ((cluster)::text = 'B2BOX'::text))
               ->  Bitmap Index Scan on item_counter_01  (cost=0.00..1052.13 rows=56515 width=0) (actual time=2.283..2.283 rows=4085 loops=1)
                     Index Cond: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) = true)
 Total runtime: 606.136 ms
(7 rows)

plan on explain.depesz.com

b2box=# set enable_seqscan=on;
SET
b2box=# explain analyze
Select count(*) 
from B2HEAD.item
where cluster = 'B2BOX' and (  ( xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()', content) )  )  offset 0 limit 1;
                                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=22555.71..22555.72 rows=1 width=0) (actual time=10864.163..10864.163 rows=1 loops=1)
   ->  Aggregate  (cost=22555.71..22555.72 rows=1 width=0) (actual time=10864.160..10864.160 rows=1 loops=1)
         ->  Seq Scan on item  (cost=0.00..22490.45 rows=26102 width=0) (actual time=33.574..10861.672 rows=4085 loops=1)
               Filter: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) AND ((cluster)::text = 'B2BOX'::text))
               Rows Removed by Filter: 108945
 Total runtime: 10864.242 ms
(6 rows)

plan on explain.depesz.com

回答1:

Planner cost parameters

Cost of first plan (seqscan off) is slightly higher but processing time much faster

This tells me that your random_page_cost and seq_page_cost are probably wrong. You're likely on storage with fast random I/O - either because most of the database is cached in RAM or because you're using SSD, SAN with cache, or other storage where random I/O is inherently fast.

Try:

SET random_page_cost = 1;
SET seq_page_cost = 1.1;

to greatly reduce the cost param differences and then re-run. If that does the job consider changing those params in postgresql.conf..

Your row-count estimates are reasonable, so it doesn't look like a planner mis-estimation problem or a problem with bad table statistics.

Incorrect query

Your query is also incorrect. OFFSET 0 LIMIT 1 without an ORDER BY will produce unpredictable results unless you're guaranteed to have exactly one match, in which case the OFFSET ... LIMIT ... clauses are unnecessary and can be removed entirely.

You're usually much better off phrasing such queries as SELECT max(...) or SELECT min(...) where possible; PostgreSQL will tend to be able to use an index to just pluck off the desired value without doing an expensive table scan or an index scan and sort.

Tips

BTW, for future questions the PostgreSQL wiki has some good information in the performance category and a guide to asking Slow query questions.

来源：https://stackoverflow.com/questions/16077982/postgresql-9-x-index-to-optimize-xpath-exists-xmlexists-queries

标签

postgresql

postgresql-9.1

postgresql-9.2