Ordered count of consecutive repeats / duplicates

后端 未结 1 1244
旧巷少年郎
旧巷少年郎 2020-12-12 03:38

I highly doubt I\'m doing this in the most efficient manner, which is why I tagged plpgsql on here. I need to run this on 2 billion rows for a

相关标签:
1条回答
  • 2020-12-12 04:16

    Test case

    First, a more useful way to present your data - or even better, in an sqlfiddle, ready to play with:

    CREATE TEMP TABLE data(
       system_measured int
     , time_of_measurement int
     , measurement int
    );
    
    INSERT INTO data VALUES
     (1, 1, 5)
    ,(1, 2, 150)
    ,(1, 3, 5)
    ,(1, 4, 5)
    ,(2, 1, 5)
    ,(2, 2, 5)
    ,(2, 3, 5)
    ,(2, 4, 5)
    ,(2, 5, 150)
    ,(2, 6, 5)
    ,(2, 7, 5)
    ,(2, 8, 5);
    

    Simplified query

    Since it remains unclear, I am assuming only the above as given.
    Next, I simplified your query to arrive at:

    WITH x AS (
       SELECT *, CASE WHEN lag(measurement) OVER (PARTITION BY system_measured
                                   ORDER BY time_of_measurement) = measurement
                      THEN 0 ELSE 1 END AS step
       FROM   data
       )
       , y AS (
       SELECT *, sum(step) OVER(PARTITION BY system_measured
                                ORDER BY time_of_measurement) AS grp
       FROM   x
       )
    SELECT * ,row_number() OVER (PARTITION BY system_measured, grp
                                 ORDER BY time_of_measurement) - 1 AS repeat_ct
    FROM   y
    ORDER  BY system_measured, time_of_measurement;
    

    Now, while it is all nice and shiny to use pure SQL, this will be much faster with a plpgsql function, because it can do it in a single table scan where this query needs at least three scans.

    Faster with plpgsql function:

    CREATE OR REPLACE FUNCTION x.f_repeat_ct()
      RETURNS TABLE (
        system_measured int
      , time_of_measurement int
      , measurement int, repeat_ct int
      )  LANGUAGE plpgsql AS
    $func$
    DECLARE
       r    data;     -- table name serves as record type
       r0   data;
    BEGIN
    
    -- SET LOCAL work_mem = '1000 MB';  -- uncomment an adapt if needed, see below!
    
    repeat_ct := 0;   -- init
    
    FOR r IN
       SELECT * FROM data d ORDER BY d.system_measured, d.time_of_measurement
    LOOP
       IF  r.system_measured = r0.system_measured
           AND r.measurement = r0.measurement THEN
          repeat_ct := repeat_ct + 1;   -- start new array
       ELSE
          repeat_ct := 0;               -- start new count
       END IF;
    
       RETURN QUERY SELECT r.*, repeat_ct;
    
       r0 := r;                         -- remember last row
    END LOOP;
    
    END
    $func$;
    

    Call:

    SELECT * FROM x.f_repeat_ct();
    

    Be sure to table-qualify your column names at all times in this kind of plpgsql function, because we use the same names as output parameters which would take precedence if not qualified.

    Billions of rows

    If you have billions of rows, you may want to split this operation up. I quote the manual here:

    Note: The current implementation of RETURN NEXT and RETURN QUERY stores the entire result set before returning from the function, as discussed above. That means that if a PL/pgSQL function produces a very large result set, performance might be poor: data will be written to disk to avoid memory exhaustion, but the function itself will not return until the entire result set has been generated. A future version of PL/pgSQL might allow users to define set-returning functions that do not have this limitation. Currently, the point at which data begins being written to disk is controlled by the work_mem configuration variable. Administrators who have sufficient memory to store larger result sets in memory should consider increasing this parameter.

    Consider computing rows for one system at a time or set a high enough value for work_mem to cope with the load. Follow the link provided in the quote on more about work_mem.

    One way would be to set a very high value for work_mem with SET LOCAL in your function, which is only effective for for the current transaction. I added a commented line in the function. Do not set it very high globally, as this could nuke your server. Read the manual.

    0 讨论(0)
提交回复
热议问题