Filter a one-to-many query by requiring all of many meet criteria

青春壹個敷衍的年華 提交于 2019-12-06 06:16:07

问题


Imagine the following tables:

create table boxes( id int, name text, ...);

create table thingsinboxes( id int, box_id int, thing enum('apple,'banana','orange');

And the tables look like:

Boxes:
id | name
1  | orangesOnly
2  | orangesOnly2
3  | orangesBananas
4  | misc

thingsinboxes:
id | box_id | thing
1  |  1     | orange
2  |  1     | orange
3  |  2     | orange
4  |  3     | orange
5  |  3     | banana
6  |  4     | orange
7  |  4     | apple
8  |  4     | banana

How do I select the boxes that contain at least one orange and nothing that isn't an orange?

How does this scale, assuming I have several hundred thousand boxes and possibly a million things in boxes?

I'd like to keep this all in SQL if possible, rather than post-processing the result set with a script.

I'm using both postgres and mysql, so subqueries are probably bad, given that mysql doesn't optimize subqueries (pre version 6, anyway).


回答1:


SELECT b.*
FROM boxes b JOIN thingsinboxes t ON (b.id = t.box_id)
GROUP BY b.id
HAVING COUNT(DISTINCT t.thing) = 1 AND SUM(t.thing = 'orange') > 0;

Here's another solution that does not use GROUP BY:

SELECT DISTINCT b.*
FROM boxes b
  JOIN thingsinboxes t1 
    ON (b.id = t1.box_id AND t1.thing = 'orange')
  LEFT OUTER JOIN thingsinboxes t2 
    ON (b.id = t2.box_id AND t2.thing != 'orange')
WHERE t2.box_id IS NULL;

As always, before you make conclusions about the scalability or performance of a query, you have to try it with a realistic data set, and measure the performance.




回答2:


I think Bill Karwin's query is just fine, however if a relatively small proportion of boxes contain oranges, you should be able to speed things up by using an index on the thing field:

SELECT b.*
FROM boxes b JOIN thingsinboxes t1 ON (b.id = t1.box_id)
WHERE t1.thing = 'orange'
AND NOT EXISTS (
    SELECT 1
    FROM thingsinboxes t2
    WHERE t2.box_id = b.id
    AND t2.thing <> 'orange'
)
GROUP BY t1.box_id

The WHERE NOT EXISTS subquery will only be run once per orange thing, so it's not too expensive provided there aren't many oranges.



来源:https://stackoverflow.com/questions/481588/filter-a-one-to-many-query-by-requiring-all-of-many-meet-criteria

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!