问题
I have got next table with data:
CREATE TABLE xml_files (
"parsing_status" Character Varying( 150 ),
"purchaseNumber" Character Varying( 2044 ),
"docPublishDate" Timestamp With Time Zone );
;
data:
purchaseNumber parsing_status docPublishDate
0373200554017000226 null 2017-07-28 19:00:10.885+03
0373200554017000226 null 2017-07-28 19:08:30.346+03
0373200554017000226 null 2017-07-28 19:24:35.265+03
0373400005317002182 null 2017-07-28 19:45:02.162+03
0348100035117000082 null 2017-07-28 20:08:26.37+03
0373200554017000292 null 2017-07-28 20:10:24.312+03
0373200081217000531 null 2017-07-28 20:13:56.166+03
0373200041517000400 null 2017-07-28 21:23:20.616+03
0373200081217000531 null 2017-07-29 08:18:29.571+03
0373200081217000531 null 2017-07-29 09:34:11.545+03
0373100026117000078 null 2017-07-29 10:37:01.161+03
0573400000117001086 null 2017-07-29 11:25:37.863+03
0573400000117001096 null 2017-07-29 11:30:36.499+03
0373200081217000531 null 2017-07-29 12:14:04.033+03
0573400000117001118 null 2017-07-29 14:50:34+03
0573400000117001118 null 2017-07-29 16:49:12.457+03
0373100026117000080 null 2017-07-29 16:52:02.013+03
0373100026117000080 null 2017-07-29 17:05:40.981+03
0373100026117000080 null 2017-07-29 17:13:29.532+03
0373200554017000226 null 2017-07-29 18:55:47.488+03
The column purchaseNumber have duplicates.
I need to select for processing all latest unparsed records. I am doing it with next SQL:
SELECT
"purchaseNumber", "parsing_status", "docPublishDate"
FROM (
select distinct on ("purchaseNumber") x.*
from xml_files x
order by "purchaseNumber", "docPublishDate" desc
) x
where parsing_status is distinct from 'true'
AND parsing_status IS NULL
order by "docPublishDate" LIMIT 100
The problem that query is take very long time on table with millions for rows. How I can improve speed? Here is data sample: https://www.db-fiddle.com/f/vycMHGLYML5K56SN77HLsY/0
回答1:
For your query, you want an index on xml_files("purchaseNumber", "docPublishDate" desc):
create index idx_xml_files_2 on xml_files("purchaseNumber", "docPublishDate" desc)
Postgres should use this index for the order by, which facilitates the distinct on.
Although it won't affect performance, I would also suggest simplifying the where clause to:
where parsing_status IS NULL
回答2:
Ok I will give you some tips
1. Improve query
SELECT t1.purchaseNumber, t1.parsing_status, t1.docPublishDate
FROM xml_files t1
LEFT JOIN xml_files t2
ON t1.purchaseNumber = t2.purchaseNumber
AND t1.docPublishDate < t2.docPublishDate
WHERE t1.parsing_status IS NULL
AND t2.parsing_status IS NULL
AND t2.docPublishDate IS NULL
2. Improve table
You can also try adding some index, but if the table only contains these 3 columns I am not sure how much you can improve. Depending on the data distribution, for example if you know that half of parsing_status are null, you can try:
create index idx_xml_files_2 on xml_files("parsing_status", "purchaseNumber")
来源:https://stackoverflow.com/questions/65493027/how-to-improve-speed-of-query