Ordered count of consecutive repeats / duplicates

后端未结

关注

 1  1244

I highly doubt I\'m doing this in the most efficient manner, which is why I tagged plpgsql on here. I need to run this on 2 billion rows for a

相关标签:

1条回答

情深已故

2020-12-12 04:16
Test case

First, a more useful way to present your data - or even better, in an sqlfiddle, ready to play with:
```
CREATE TEMP TABLE data(
   system_measured int
 , time_of_measurement int
 , measurement int
);

INSERT INTO data VALUES
 (1, 1, 5)
,(1, 2, 150)
,(1, 3, 5)
,(1, 4, 5)
,(2, 1, 5)
,(2, 2, 5)
,(2, 3, 5)
,(2, 4, 5)
,(2, 5, 150)
,(2, 6, 5)
,(2, 7, 5)
,(2, 8, 5);
```
Simplified query

Since it remains unclear, I am assuming only the above as given.
Next, I simplified your query to arrive at:
```
WITH x AS (
   SELECT *, CASE WHEN lag(measurement) OVER (PARTITION BY system_measured
                               ORDER BY time_of_measurement) = measurement
                  THEN 0 ELSE 1 END AS step
   FROM   data
   )
   , y AS (
   SELECT *, sum(step) OVER(PARTITION BY system_measured
                            ORDER BY time_of_measurement) AS grp
   FROM   x
   )
SELECT * ,row_number() OVER (PARTITION BY system_measured, grp
                             ORDER BY time_of_measurement) - 1 AS repeat_ct
FROM   y
ORDER  BY system_measured, time_of_measurement;
```
Now, while it is all nice and shiny to use pure SQL, this will be much faster with a plpgsql function, because it can do it in a single table scan where this query needs at least three scans.

Faster with plpgsql function:
```
CREATE OR REPLACE FUNCTION x.f_repeat_ct()
  RETURNS TABLE (
    system_measured int
  , time_of_measurement int
  , measurement int, repeat_ct int
  )  LANGUAGE plpgsql AS
$func$
DECLARE
   r    data;     -- table name serves as record type
   r0   data;
BEGIN

-- SET LOCAL work_mem = '1000 MB';  -- uncomment an adapt if needed, see below!

repeat_ct := 0;   -- init

FOR r IN
   SELECT * FROM data d ORDER BY d.system_measured, d.time_of_measurement
LOOP
   IF  r.system_measured = r0.system_measured
       AND r.measurement = r0.measurement THEN
      repeat_ct := repeat_ct + 1;   -- start new array
   ELSE
      repeat_ct := 0;               -- start new count
   END IF;

   RETURN QUERY SELECT r.*, repeat_ct;

   r0 := r;                         -- remember last row
END LOOP;

END
$func$;
```
Call:
```
SELECT * FROM x.f_repeat_ct();
```
Be sure to table-qualify your column names at all times in this kind of plpgsql function, because we use the same names as output parameters which would take precedence if not qualified.

Billions of rows

If you have billions of rows, you may want to split this operation up. I quote the manual here:

Note: The current implementation of RETURN NEXT and RETURN QUERY stores the entire result set before returning from the function, as discussed above. That means that if a PL/pgSQL function produces a very large result set, performance might be poor: data will be written to disk to avoid memory exhaustion, but the function itself will not return until the entire result set has been generated. A future version of PL/pgSQL might allow users to define set-returning functions that do not have this limitation. Currently, the point at which data begins being written to disk is controlled by the work_mem configuration variable. Administrators who have sufficient memory to store larger result sets in memory should consider increasing this parameter.

Consider computing rows for one system at a time or set a high enough value for work_mem to cope with the load. Follow the link provided in the quote on more about work_mem.

One way would be to set a very high value for work_mem with SET LOCAL in your function, which is only effective for for the current transaction. I added a commented line in the function. Do not set it very high globally, as this could nuke your server. Read the manual.
0 讨论(0)
发布评论:

提交评论
- 加载中...

Ordered count of consecutive repeats / duplicates

Test case

Simplified query

Faster with plpgsql function:

Billions of rows