PostgreSQL - column value changed - select query optimization

老子叫甜甜 提交于 2019-12-10 03:56:22

问题


Say we have a table:

CREATE TABLE p
(
   id serial NOT NULL, 
   val boolean NOT NULL, 
   PRIMARY KEY (id)
);

Populated with some rows:

insert into p (val)
values (true),(false),(false),(true),(true),(true),(false);
ID  VAL
1   1
2   0
3   0
4   1
5   1
6   1
7   0

I want to determine when the value has been changed. So the result of my query should be:

ID  VAL
2   0
4   1
7   0

I have a solution with joins and subqueries:

select min(id) id, val from
(
  select p1.id, p1.val, max(p2.id) last_prev
  from p p1
  join p p2
    on p2.id < p1.id and p2.val != p1.val
  group by p1.id, p1.val
) tmp
group by val, last_prev
order by id;

But it is very inefficient and will work extremely slow for tables with many rows.
I believe there could be more efficient solution using PostgreSQL window functions?

SQL Fiddle


回答1:


This is how I would do it with an analytic:

SELECT id, val
  FROM ( SELECT id, val
           ,LAG(val) OVER (ORDER BY id) AS prev_val
       FROM p ) x
  WHERE val <> COALESCE(prev_val, val)
  ORDER BY id

Update (some explanation):

Analytic functions operate as a post-processing step. The query result is broken into groupings (partition by) and the analytic function is applied within the context of a grouping.

In this case, the query is a selection from p. The analytic function being applied is LAG. Since there is no partition by clause, there is only one grouping: the entire result set. This grouping is ordered by id. LAG returns the value of the previous row in the grouping using the specified order. The result is each row having an additional column (aliased prev_val) which is the val of the preceding row. That is the subquery.

Then we look for rows where the val does not match the val of the previous row (prev_val). The COALESCE handles the special case of the first row which does not have a previous value.

Analytic functions may seem a bit strange at first, but a search on analytic functions finds a lot of examples walking through how they work. For example: http://www.cs.utexas.edu/~cannata/dbms/Analytic%20Functions%20in%20Oracle%208i%20and%209i.htm Just remember that it is a post-processing step. You won't be able to perform filtering, etc on the value of an analytic function unless you subquery it.




回答2:


Window function

Instead of calling COALESCE, you can provide a default from the window function lag() directly. A minor detail in this case since all columns are defined NOT NULL. But this may be essential to distinguish "no previous row" from "NULL in previous row".

SELECT id, val
FROM  (
   SELECT id, val, lag(val, 1, val) OVER (ORDER BY id) <> val AS changed
   FROM   p
   ) sub
WHERE  changed
ORDER  BY id;

Compute the result of the comparison immediately, since the previous value is not of interest per se, only a possible change. Shorter and may be a tiny bit faster.

If you consider the first row to be "changed" (unlike your demo output suggests), you need to observe NULL values - even though your columns are defined NOT NULL. Basic lag() returns NULL in case there is no previous row:

SELECT id, val
FROM  (
   SELECT id, val, lag(val) OVER (ORDER BY id) IS DISTINCT FROM val AS changed
   FROM   p
   ) sub
WHERE  changed
ORDER  BY id;

Or employ the additional parameters of lag() once again:

SELECT id, val
FROM  (
   SELECT id, val, lag(val, 1, NOT val) OVER (ORDER BY id) <> val AS changed
   FROM   p
   ) sub
WHERE  changed
ORDER  BY id;

Recursive CTE

As proof of concept. :) Performance won't keep up with posted alternatives.

WITH RECURSIVE cte AS (
   SELECT id, val
   FROM   p
   WHERE  NOT EXISTS (
      SELECT 1
      FROM   p p0
      WHERE  p0.id < p.id
      )

   UNION ALL
   SELECT p.id, p.val
   FROM   cte
   JOIN   p ON p.id   > cte.id
           AND p.val <> cte.val
   WHERE NOT EXISTS (
     SELECT 1
     FROM   p p0
     WHERE  p0.id   > cte.id
     AND    p0.val <> cte.val
     AND    p0.id   < p.id
     )
  )
SELECT * FROM cte;

With an improvement from @wildplasser.

SQL Fiddle demonstrating all.




回答3:


Can even be done without window functions.

SELECT * FROM p p0
WHERE EXISTS (
        SELECT * FROM p ex
        WHERE ex.id < p0.id
        AND ex.val <> p0.val
        AND NOT EXISTS (
                SELECT * FROM p nx
                WHERE nx.id < p0.id
                AND nx.id > ex.id
                )
        );

UPDATE: Self-joining a non-recursive CTE (could also be a subquery instead of a CTE)

WITH drag AS (
        SELECT id
        , rank() OVER (ORDER BY id) AS rnk
        , val
        FROM p
        )
SELECT d1.*
FROM drag d1
JOIN drag d0 ON d0.rnk = d1.rnk -1
WHERE d1.val <> d0.val
        ;

This nonrecursive CTE approach is surprisingly fast, although it needs an implicit sort.




回答4:


Using 2 row_number() computations: This is also possible to do with usual "islands and gaps" SQL technique (could be useful if you can't use lag() window function for some reason:

with cte1 as (
    select
        *,
        row_number() over(order by id) as rn1,
        row_number() over(partition by val order by id) as rn2
    from p
)
select *, rn1 - rn2 as g
from cte1
order by id

So this query will give you all islands

ID VAL RN1 RN2  G
1   1   1   1   0
2   0   2   1   1
3   0   3   2   1
4   1   4   2   2
5   1   5   3   2
6   1   6   4   2
7   0   7   3   4

You see, how G field could be used to group this islands together:

with cte1 as ( select *, row_number() over(order by id) as rn1, row_number() over(partition by val order by id) as rn2 from p ) select min(id) as id, val from cte1 group by val, rn1 - rn2 order by 1

So you'll get

ID VAL
1   1
2   0
4   1
7   0

The only thing now is you have to remove first record which can be done by getting min(...) over() window function:

with cte1 as (
   ...
), cte2 as (
    select
        min(id) as id,
        val,
        min(min(id)) over() as mid
    from cte1
    group by val, rn1 - rn2
)
select id, val
from cte2
where id <> mid

And results:

ID VAL
2   0
4   1
7   0



回答5:


A simple inner join can do it. SQL Fiddle

select p2.id, p2.val
from
    p p1
    inner join
    p p2 on p2.id = p1.id + 1
where p2.val != p1.val


来源:https://stackoverflow.com/questions/24098970/postgresql-column-value-changed-select-query-optimization

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!