问题
I have two tables in presto.
So the table1 looks like:
+--------+-------------+--------
|id1 | id2 | date | degree |
+--------+-------------+--------
| 1 | 10 | 20200101 | 1 |
| 1 | 11 | 20200101 | 1 |
| 1 | 11 | 20200101 | 1 |
| 2 | 52 | 20200101 | 2 . |
| 2 | 52 | 20200101 | 2 . |
| 2 | 53 | 20200101 | . 2 . |
| 3 | 21 | 20200101 | 2 . |
| ...| ... | ... | ... |
+--------+-----------+----------
and table2 is:
+--------+------------+-------+-------
|id1 | id2 | date | price | rank |
+--------+-------------+-------+-------
| 1 | 10 | 20200101 | 1200 | 1 |
| 1 | 10 | 20200101 | 1200 | 2 |
| 1 | 10 | 20200101 | | |
| 1 | 10 | 20200101 | 1300 | 1 |
| 1 | 10 | 20200101 | 1300 | 2 |
| ...| ... | ... | ... |... |
+--------+-----------+-----------------
what I want to do to get price column from table2 and add it to table1 based three columns id1, id2 and date. If I do a simple join like this
select tab1.id1, tab1.id2, tab1.date, tab2.price
from tab1
left join tab2
on tab1.id1 = tab2.id1
and tab1.id2 = tab2.id2
and tab1.date = tab2.date
this is what we have:
+--------+------------+----------------
|id1 | id2 | date | price | degree |
+--------+-------------+----------------
| 1 | 10 | 20200101 | 1200 | 1 |
| 1 | 10 | 20200101 | 1200 | 1 |
| 1 | 10 | 20200101 | | 1 |
| 1 | 10 | 20200101 | 1300 | 1 |
| 1 | 10 | 20200101 | 1300 | 1 |
+--------+-----------+-------------------
but in fact what I want is this one:
+--------+------------+----------------
|id1 | id2 | date | price | degree |
+--------+-------------+----------------
| 1 | 10 | 20200101 | 1200 | . 1 . |
| 1 | 10 | 20200101 | 1300 | 1 . |
+--------+-----------+-------------------
回答1:
use group
select * from (
select tab1.id1 as id1, tab1.id2 as id2, tab1.date as date, tab2.price as price
from tab1
left join tab2
on tab1.id1 = tab2.id1
and tab1.id2 = tab2.id2
and tab1.date = tab2.date) as t group by t.id1,t.id2,t.date,t.price
回答2:
This involves some speculation about your data, but based on your example it looks like if you limit the rank column to the value 1, it will give the desired results.
select
tab1.id1, tab1.id2, tab1.date, tab2.price
from
tab1
join tab2 on
tab1.id1 = tab2.id1 and
tab1.id2 = tab2.id2 and
tab1.date = tab2.date and
tab2.rank = 1 -- add this line
Of course, if that's not true across the dataset, then this won't work.
In most cases, I like to avoid select distinct
and its derivations (including group by every column, which is essentially a select distinct) because it has a very arbitrary feel to it -- just remove any records that happen to be the same. Instead, I think it's better to understand your data and know why certain records are being screened out.
If, for example, you really do want to pick the record with the lowest "rank" value, but it's not always guaranteed to be the value of 1, this would work:
select distinct on (tab1.id1, tab1.id2, tab1.date)
tab1.id1, tab1.id2, tab1.date, tab2.price
from
tab1
join tab2 on
tab1.id1 = tab2.id1 and
tab1.id2 = tab2.id2 and
tab1.date = tab2.date and
tab2.rank = 1 -- add this line
order by
tab1.id1, tab1.id2, tab1.date, tab2.rank
I know I just said I avoid select distinct
, but this is actually a select distinct on
which is quite different, and the order by
makes it very unambiguous as to which record is retained and why.
来源:https://stackoverflow.com/questions/60181248/select-distinct-join-on-multiple-column-on-presto