POSTGRES select n equally distributed rows by time over millions of records

[亡魂溺海] 提交于 2021-02-11 15:17:46

问题


I have a table with columns id,filter1,filter2,time,value which contains millions of records. I want to fetch n equally distributed rows between two timestamps. If the number of records between timestamps is less than n I want to fetch all the records.

My current query looks like below, assuming n=200

SELECT s.* FROM (
    SELECT t.time, t.value,
           ROW_NUMBER() OVER(ORDER BY t.time) as rnk,
           COUNT(*) OVER() as total_cnt
    FROM table_name t
    WHERE t.filter1='filter_value' 
    and t.filter2='another_value' 
    and t.time between '2020-04-18' AND '2020-04-19') s

WHERE MOD(s.rnk,(total_cnt/200)) = 0 ;

I have an index on 'filter1,filter2,time'. Still this query is extremely slow when there is around 10 million records.

I have also tried TABLESAMPLE but I couldn't come up with an appropriate condition for percentage which is fast enough and also returns all rows when the number of rows is less.


回答1:


If ...

  • ... you have no additional meta information about logical or physical data distribution
  • ... and need the selection to be spread out equally over time

... then your original query is basically as good as it gets. You have the index on (filter1,filter2,time) like Gordon suggested. It helps (a lot) if less than a few percent pass the filters. We then have to count and number all qualifying rows (the expensive part for many qualifying rows) to get a strictly uniform distribution in the sample.

A few minor suggestions:

SELECT s.*
FROM  (
   SELECT t.time, t.value
        , row_number() OVER (ORDER BY t.time) AS rn  -- ①
        , count(*) OVER() AS total_cnt
   FROM   table_name t
   WHERE  t.filter1 = 'filter_value' 
   AND    t.filter2 = 'another_value' 
   AND    t.time >= '2020-04-18'  -- assuming data type timestamp!
   AND    t.time <  '2020-04-20'  -- ②
   ) s
WHERE  mod(s.rn, total_cnt/n) = total_cnt/n/2 + 1;  -- ③

① Use the column alias rn (or whatever) for row_number(); rnk would hint at rank().

② Assuming the column "time" is data type timestamp as neither date nor time would make sense. ("time" seems misleading.) So this predicate is most likely wrong:

t.time between '2020-04-18' AND '2020-04-19'

The given date literals are coerced to timestamps 2020-04-18 0:0 / 2020-04-19 0:0. Since BETWEEN includes lower and upper bound the filter effectively selects all of 2020-04-18 plus the first instant of 2020-04-19. Hardly ever makes sense. My suggested fix includes all of 2020-04-18 and 2020-04-19.

If the column "time" is data type timestamptz, then the above basically applies as well. Plus, you add a dependency on the timezone setting of the database session into the mix. Don't! See:

  • Subtract hours from the now() function
  • Generating time series between two dates in PostgreSQL

③ Your original condition MOD(s.rnk,(total_cnt/n)) = 0 picks every total_cnt/n-th row, always skipping the first total_cnt/n - 1 rows, which creates a bias for later rows. To illustrate:

ooooXooooXooooXooooX

My alternative shifts the selection towards the center, which seems more reasonable:

ooXooooXooooXooooXoo

Integer division might produce 0. Adding 1 (total_cnt/n/2 + 1) prevents that from happening. Plus it's more in the "center" in any case.

Finally, it should be mentioned that the result for equal values in time is arbitrary. You might want to define a tiebreaker if that matters ...

That said, we might be able to use any meta information about data distribution to our advantage. Or if we can relax requirements for strictly uniform distribution in the sample (to what degree?).


Radically faster with only index scans

If we can assume uniform data distribution over time for all (or some) combinations of (filter1, filter2) we can just partition the time interval and get away with n very cheap index (only) scans. (Or if we don't care about uniform data distribution too much, we might do it anyway.) To illustrate:

WITH input (f1    , f2    , lo                    , hi                    , n) AS (
   VALUES  ('val2', 'val2', timestamp '2020-04-18', timestamp '2020-04-20', 200)
   )
SELECT g.lo, s.*
FROM   (SELECT *, (hi - lo) / n AS span FROM input) i
CROSS  JOIN generate_series(lo, hi - span, span) g(lo)
LEFT   JOIN LATERAL (   
   SELECT t.time, t.value
   FROM   table_name t
   WHERE  t.filter1 = i.f1
   AND    t.filter2 = i.f2
   AND    t.time >= g.lo
   AND    t.time <  g.lo + span
   ORDER  BY time
   LIMIT  1
   ) s ON true;

This is just a proof of concept, which can be tweaked in a hundred and one ways. There is a lot going on in this query and not enough information about the case to streamline.

The main objective is to avoid processing all rows, and only fetch the ones to return.

The query starts at the lower bound, producing a selection pattern like:

XooooXooooXooooXoooo

The LEFT JOIN keeps empty time slices in the result, which indicate non-uniform data distribution.

Any kind of meta information on table design, data distribution, write patterns etc. might be used to optimize further. Indexing might be optimized: index-only scans, partial indexes, ...




回答2:


For this query:

SELECT s.*
FROM (SELECT t.time, t.value,
             ROW_NUMBER() OVER (ORDER BY t.time) as rnk,
             COUNT(*) OVER () as total_cnt
      FROM table_name t
      WHERE t.filter1 = 'filter_value' AND
            t.filter2 = 'another_value' AND
            t.time between '2020-04-18' AND '2020-04-19'
     ) s
WHERE MOD(s.rnk, (total_cnt / 200)) = 0 ;

You want an index on (filter1, filter2, time). This should help performance.



来源:https://stackoverflow.com/questions/61385262/postgres-select-n-equally-distributed-rows-by-time-over-millions-of-records

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!