select distinct join on multiple column on presto

问题

I have two tables in presto.

So the table1 looks like:

+--------+-------------+--------
|id1 | id2 |  date     | degree |
+--------+-------------+--------
|  1 |  10 |  20200101 |   1    |
|  1 |  11 |  20200101 |   1    |
|  1 |  11 |  20200101 |   1    |
|  2 |  52 |  20200101 |   2 .  |
|  2 |  52 |  20200101 |   2 .  |
|  2 |  53 |  20200101 | . 2 .  |
|  3 |  21 |  20200101 |   2 .  |
| ...| ... |  ...      |  ...   |
+--------+-----------+----------

and table2 is:

 +--------+------------+-------+-------
|id1 | id2 |  date     | price | rank |
+--------+-------------+-------+-------
|  1 |  10 |  20200101 |  1200 | 1    |
|  1 |  10 |  20200101 |  1200 | 2    |
|  1 |  10 |  20200101 |       |      |
|  1 |  10 |  20200101 |  1300 | 1    |
|  1 |  10 |  20200101 |  1300 | 2    |
| ...| ... |  ...      |   ... |...   |
+--------+-----------+-----------------

what I want to do to get price column from table2 and add it to table1 based three columns id1, id2 and date. If I do a simple join like this

select tab1.id1, tab1.id2, tab1.date, tab2.price
from tab1
left join tab2
on tab1.id1 = tab2.id1
and tab1.id2 = tab2.id2
and tab1.date = tab2.date

this is what we have:

 +--------+------------+----------------
|id1 | id2 |  date     | price | degree |
+--------+-------------+----------------
|  1 |  10 |  20200101 |  1200 |   1    |
|  1 |  10 |  20200101 |  1200 |   1    |
|  1 |  10 |  20200101 |       |   1    |
|  1 |  10 |  20200101 |  1300 |   1    |
|  1 |  10 |  20200101 |  1300 |   1    |
+--------+-----------+-------------------

but in fact what I want is this one:

 +--------+------------+----------------
|id1 | id2 |  date     | price | degree |
+--------+-------------+----------------
|  1 |  10 |  20200101 |  1200 | . 1 .  |
|  1 |  10 |  20200101 |  1300 |   1 .  |
+--------+-----------+-------------------

回答1:

use group

select * from (
 select tab1.id1 as id1, tab1.id2 as id2, tab1.date as date, tab2.price as price
 from tab1
 left join tab2
 on tab1.id1 = tab2.id1
 and tab1.id2 = tab2.id2
 and tab1.date = tab2.date) as t group by t.id1,t.id2,t.date,t.price

回答2:

This involves some speculation about your data, but based on your example it looks like if you limit the rank column to the value 1, it will give the desired results.

select
  tab1.id1, tab1.id2, tab1.date, tab2.price
from
  tab1
  join tab2 on
    tab1.id1 = tab2.id1 and
    tab1.id2 = tab2.id2 and
    tab1.date = tab2.date and
    tab2.rank = 1 -- add this line

Of course, if that's not true across the dataset, then this won't work.

In most cases, I like to avoid select distinct and its derivations (including group by every column, which is essentially a select distinct) because it has a very arbitrary feel to it -- just remove any records that happen to be the same. Instead, I think it's better to understand your data and know why certain records are being screened out.

If, for example, you really do want to pick the record with the lowest "rank" value, but it's not always guaranteed to be the value of 1, this would work:

select distinct on (tab1.id1, tab1.id2, tab1.date)
  tab1.id1, tab1.id2, tab1.date, tab2.price
from
  tab1
  join tab2 on
    tab1.id1 = tab2.id1 and
    tab1.id2 = tab2.id2 and
    tab1.date = tab2.date and
    tab2.rank = 1 -- add this line
order by
  tab1.id1, tab1.id2, tab1.date, tab2.rank

I know I just said I avoid select distinct, but this is actually a select distinct on which is quite different, and the order by makes it very unambiguous as to which record is retained and why.

来源：https://stackoverflow.com/questions/60181248/select-distinct-join-on-multiple-column-on-presto

标签

sql

postgresql

presto