GROUP or DISTINCT after JOIN returns duplicates

后端 未结 4 1112
深忆病人
深忆病人 2020-12-19 14:01

I have two tables, products and meta. They are in relation 1:N where each product row has at least one meta row via foreign key.

(viz. SQLf

相关标签:
4条回答
  • 2020-12-19 14:38

    While retrieving all or most rows from a table, the fastest way for this type of query typically is to aggregate / disambiguate first and join later:

    SELECT *
    FROM   products p
    JOIN  (
       SELECT DISTINCT ON (product_id) *
       FROM   meta
       ORDER  BY product_id, id DESC
       ) m ON m.product_id = p.id;
    

    The more rows in meta per row in products, the bigger the impact on performance.

    Of course, you'll want to add an ORDER BY clause in the subquery do define which row to pick form each set in the subquery. @Craig and @Clodoaldo already told you about that. I am returning the meta row with the highest id.

    SQL Fiddle.

    Details for DISTINCT ON:

    • Select first row in each GROUP BY group?

    Optimize performance

    Still, this is not always the fastest solution. Depending on data distribution there are various other query styles. For this simple case involving another join, this one ran considerably faster in a test with big tables:

    SELECT p.*, sub.meta_id, m.product_id, m.price, m.flag
    FROM  (
       SELECT product_id, max(id) AS meta_id
       FROM   meta
       GROUP  BY 1
       ) sub
    JOIN meta     m ON m.id = sub.meta_id
    JOIN products p ON p.id = sub.product_id;
    

    If you wouldn't use the non-descriptive id as column names, we would not run into naming collisions and could simply write SELECT p.*, m.*. (I never use id as column name.)

    If performance is your paramount requirement, consider more options:

    • a MATERIALIZED VIEW with pre-aggregated data from meta, if your data does not change (much).
    • a recursive CTE emulating a loose index scan for a big meta table with many rows per product (relatively few distinct product_id).
      This is the only way I know to use an index for a DISTINCT query over the whole table.
    0 讨论(0)
  • 2020-12-19 14:55

    You can use a subquery to identify the max(ID) for each product, then use that in the superquery to gather the details you want to display:

    SELECT q.product_id, meta.* from
    (SELECT product_id, max(meta.ID)
     FROM meta JOIN products ON products.id=meta.product_id 
     GROUP BY product_id) q 
    JOIN meta ON q.max=meta.id;
    

    It is not the only solution!

    A quick comparison to use of DISTINCT ON solutions suggests that it is slower (http://sqlfiddle.com/#!15/c8f34/38). It avoids a full sort on ID and prefers a sequential scan.

    0 讨论(0)
  • 2020-12-19 15:01

    Use distinct on as suggested by @Craig's answer but combined with the order by clause as explicated in the comments. SQL Fiddle

    select distinct on(m.product_id) * 
    from
        meta m
        inner join
        products p on p.id = m.product_id
    order by m.product_id, m.id desc;
    
    0 讨论(0)
  • 2020-12-19 15:02

    I think you might be looking for DISTINCT ON, a PostgreSQL extension feature:

    SELECT 
      DISTINCT ON(product_id)
      * 
    FROM meta 
    INNER JOIN products ON products.id = meta.product_id;
    

    http://sqlfiddle.com/#!15/c8f34/18

    However, note that without an ORDER BY the results are not guaranteed to be consistent; the database can pick any row it wants from the matching rows.

    0 讨论(0)
提交回复
热议问题