Using DISTINCT inside JOIN is creating trouble [duplicate]

Possible Duplicate:
How can I modify this query with two Inner Joins so that it stops giving duplicate results?

I'm having trouble getting my query to work.

SELECT itpitems.identifier, itpitems.name, itpitems.subtitle, itpitems.description, itpitems.itemimg, itpitems.mainprice, itpitems.upc, itpitems.isbn, itpitems.weight, itpitems.pages, itpitems.publisher, itpitems.medium_abbr, itpitems.medium_desc, itpitems.series_abbr, itpitems.series_desc, itpitems.voicing_desc, itpitems.pianolevel_desc, itpitems.bandgrade_desc, itpitems.category_code, itprank.overall_ranking, itpitnam.name AS artist, itpitnam.type_code FROM itpitems 
        INNER JOIN  itprank ON (itprank.item_number = itpitems.identifier) 
        INNER JOIN  (SELECT DISTINCT type_code FROM itpitnam) itpitnam ON (itprank.item_number = itpitnam.item_number)   
        WHERE mainprice > 1    
        LIMIT 3

I keep getting Unknown column 'itpitnam.name' in 'field list'.

However, if I change DISTINCT type_code to *, I do not get that error, but I do not get the results I want either.

This is a big result table so I am making a dummy example...

With *, I get something like:

+-----------+---------+----------+
| identifier| name    | type_code|
+-----------+---------+----------+
| 2         | Joe     | A        |
| 2         | Amy     | R        |
| 7         | Mike    | B        |
+-----------+------------+-------+

The problem here is that I have two instances of identifier = 2 because the type_code is different. I have tried GROUP BY at the outside end of the query, but it is sifting through so many records it creates too much strain on the server, so I'm trying to find an alternative way of getting the results I need.

What I want to achieve (using the same dummy output) would look something like this:

+-----------+---------+----------+
| identifier| name    | type_code|
+-----------+---------+----------+
| 2         | Joe     | A        |
| 7         | Mike    | B        |
| 8         | Sam     | R        |
+-----------+------------+-------+

It should skip over the duplicate identifier regardless if type_code is different.

Can someone help me modify this query to get the results as simulated in the above chart?

One approach is to use an inline view, like the query you already have. But instead of using DISTINCT, you would use a GROUP BY to eliminate duplicates. The simplest inline view to satisfy your requirements would be:

( SELECT n.item_number, n.name, n.type_code
    FROM itpitnam n
   GROUP BY n.item_number
) itpitnam

Although its not deterministic as to which row from itpitnam the values for name and type_code are retrieved from. A more elaborate inline view can make this more specific.

Another common approach to this type of problem is to use a correlated subquery in the SELECT list. For returning a small set of rows, this can perform reasonably well. But for returning large sets, there are more efficient approaches.

SELECT i.identifier
     , i.name
     , i.subtitle
     , i.description
     , i.itemimg 
     , i.mainprice
     , i.upc
     , i.isbn
     , i.weight
     , i.pages
     , i.publisher
     , i.medium_abbr
     , i.medium_desc
     , i.series_abbr
     , i.series_desc
     , i.voicing_desc
     , i.pianolevel_desc
     , i.bandgrade_desc
     , i.category_code
     , r.overall_ranking
     , ( SELECT n1.name
           FROM itpitnam n1
          WHERE n1.item_number = r.item_number
          ORDER BY n1.type_code, n1.name
          LIMIT 1
       ) AS artist
     , ( SELECT n2.type_code
           FROM itpitnam n2
          WHERE n2.item_number = r.item_number
          ORDER BY n2.type_code, n2.name
          LIMIT 1
       ) AS type_code
  FROM itpitems i
  JOIN itprank r
    ON r.item_number = i.identifier
 WHERE mainprice > 1
 LIMIT 3

That query will return the specified resultset, with one significant difference. The original query shows an INNER JOIN to the itpitnam table. That means that a row will be returned ONLY of there is a matching row in the itpitnam table. The query above, however, emulates an OUTER JOIN, the query will return a row when there is no matching row found in itpitnam.

UPDATE

For best performance of those correlated subqueries, you'll want an appropriate index available,

... ON itpitnam (item_number, type_code, name)

That index is most appropriate because it's a "covering index", the query can be satisfied entirely from the index without referencing data pages in the underlying table, and there's equality predicate on the leading column, and an ORDER BY on the next two columns, so that will a avoid a "sort" operation.

If you have a guarantee that either the type_code or name column in the itpitnam table is NOT NULL, you can add a predicate to eliminate the rows that are "missing" a matching row, e.g.

HAVING artist IS NOT NULL

(Adding that will likely have an impact on performance.) Absent that kind of guarantee, you'd need to add an INNER JOIN or a predicate that tests for the existence of a matching row, to get an INNER JOIN behavior.

SELECT  a.*
        b.overall_ranking, 
        c.name AS artist, 
        c.type_code 
FROM    itpitems a
        INNER JOIN  itprank b 
            ON b.item_number = a.identifier
        INNER JOIN  itpitnam c
            ON b.item_number = c.item_number
        INNER JOIN
        (
            SELECT  item_number, MAX(type_code) code
            FROM    itpitnam
            GROUP   BY item_number
        ) d ON  c.item_number = d.item_number AND
                c.type_code = d.code

WHERE   mainprice > 1    
LIMIT   3

Follow-up question: can you please post the table schema and how are the tables related with each other? So I will know what are the columns to be linked.

来源：https://stackoverflow.com/questions/14658674/using-distinct-inside-join-is-creating-trouble

标签

mysql

sql

greatest-n-per-group