问题
MYSQL/MARIADB Schema and sample data:
CREATE DATABASE IF NOT EXISTS `puzzle` DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_unicode_ci;
USE `puzzle`;
DROP TABLE IF EXISTS `event`;
CREATE TABLE `event` (
`eventId` bigint(20) NOT NULL AUTO_INCREMENT,
`sourceId` bigint(20) NOT NULL COMMENT 'think of source as camera',
`carNumber` varchar(40) NOT NULL COMMENT 'ex: 5849',
`createdOn` datetime DEFAULT NULL,
PRIMARY KEY (`eventId`)
) ENGINE=INNODB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
INSERT INTO `event` (`eventId`, `sourceId`, `carNumber`, `createdOn`) VALUES
(1, 44, '4456', '2016-09-20 20:24:05'),
(2, 26, '26484', '2016-09-20 20:24:05'),
(3, 5, '4456', '2016-09-20 20:24:06'),
(4, 3, '72704', '2016-09-20 20:24:15'),
(5, 3, '399606', '2016-09-20 20:26:15'),
(6, 5, '4456', '2016-09-20 20:27:25'),
(7, 44, '72704', '2016-09-20 20:29:25'),
(8, 3, '4456', '2016-09-20 20:30:55'),
(9, 44, '26484', '2016-09-20 20:34:55'),
(10, 26, '4456', '2016-09-20 20:35:15'),
(11, 3, '72704', '2016-09-20 20:35:15'),
(12, 3, '399606', '2016-09-20 20:44:35'),
(13, 26, '4456', '2016-09-20 20:49:45');
I want to get CarNumber(s) that have sourceId = 3 AND (26 OR 44) during 20:24 to 20:45. the query need to be fast since the real table contains over 300 million records.
so far below is the maximum i could go with the query (its not even producing valid results)
select * from event e where
e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00'
and e.sourceId IN(3,26,44) group by e.carNumber;
the correct results for the provided data:
carNumber
4456
72704
I am really puzzled and stuck. i tried EXISTS, Joins, sub-query without luck, so I wonder if SQL is able to solve this question or should I use backend coding?
MySQL / MariaDB version in use:
mariadb-5.5.50
mysql-5.5.51
回答1:
If you need this to be fast, then the following might work, assuming you have an index on event(createdOn, carNumber, SourceId):
select e.carNumber
from event e
where e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00'
group by e.carNumber
having sum(e.sourceId = 3) > 0 and
sum(e.sourceId IN (26, 44)) > 0;
I would be inclined to change this to:
select e.carNumber
from event e
where e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00' and
e.sourceId in (3, 26, 44)
group by e.carNumber
having sum(e.sourceId = 3) > 0 and
sum(e.sourceId IN (26, 44)) > 0;
And then for performance, even this:
select carNumber
from ((select carNumber, sourceId
from event e
where e.sourceId = 3 and
e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00'
) union all
(select carNumber, sourceId
from event e
where e.sourceId = 26 and
e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00'
) union all
(select carNumber, sourceId
from event e
where e.sourceId = 44 and
e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00'
)
) e
group by e.carNumber
having sum(e.sourceId = 3) > 0 and
sum(e.sourceId IN (26, 44)) > 0;
This version can take advantage of an index on event(sourceId, createdOn, carNumber). Each subquery should use this index very effectively, bringing a small'ish amount of data together for the final aggregation.
回答2:
You can use the having clause to filter on groups. Use sum() to count how many times certains conditions are present in a group of data
select e.carNumber
from event e
where e.createdOn > '2016-09-20 20:24:00'
and e.createdOn < '2016-09-20 20:45:00'
group by e.carNumber
having sum(e.sourceId = 3) > 0
and sum(e.sourceId IN (26,44)) > 0
回答3:
Something like the following should do the trick for you:
SELECT carNumber
FROM event
WHERE sourceID = 3
AND carNumber IN (SELECT carNumber FROM event WHERE sourceID IN(26,44))
GROUP BY carNumber
That WHERE clause looks for records with a sourceID of 3 and then also makes sure that the carnumber has at least one other record in the table where the sourceid is either 26 or 44
Don't code anything outside of SQL for this one since this is definitely a problem that SQL is built to solve as quickly as possible.
回答4:
Shrink the table size
With 300M rows, you should really use the smallest datatypes that are practical.
BIGINTtakes 8 bytes;INT UNSIGNED(only 4 bytes) is usually sufficient (max of 4 billion). If fewer than 65K cameras, use a 2-byteSMALLINT UNSIGNED.carNumberlooks like a number, so why useVARCHAR? The examples you have are taking 5-7 bytes inVARCHAR, would fit in 4 bytes withINT UNSIGNEDor 3 bytes withMEDIUMINT UNSIGNED(max of 16M).
Shrinking the table will help any solution chosen.
Covering index
This has already been suggested in other answers, but I want to make it clear why it helps. If all the columns exist in a single query, the query can be performed in the index's BTree, without touching the data. This is usually faster due to being smaller. A 'covering' index for this query has source_id, car_number, createdOn in any order.
Order of columns in index
Since an index can only be used left-to-right the order is important. (This does not apply to Gordon's first select, which needs createdOn first.)
sourceIdis handled with=orIN, so it should come first. In the case ofIN, you probably need 5.6 or later to get the IN optimizations.createdOnis a range, so the lookup will stop with it.- For "covering", now any extra columns can be added on. In this case,
carNumber.
So, most (not all) suggestions want this order: INDEX(sourceId, createdOn, carNumber).
Get rid of auto_increment
Do you use eventID in other tables? If so, then you should probably keep it. If not, then is the combo (sourceId, createdOn, carNumber) unique? If so, then make that the PRIMARY KEY. Surrogate PK is nice for some situations, but it hinders performance in others. I am suggesting that it may be a hindrance here.
Avoid slow operations
UNION usually involves a temp table; this adds overhead. While UNION is beneficial in making better use of indexes, and avoiding OR, the overhead of the tmp table may outweigh the benefits for what seems to be a small resultset.
Gordon was right to use UNION ALL instead of the default UNION DISTINCT; the latter needs a de-dup pass, which is unnecessary for his query.
Bottom Line
- Shrink the table.
- Change the PK if possible; if not, add the suggested index.
- Upgrade to at least 5.6
- Use Gordon's second query.
Another solution
(I don't know if this is better, but it might be worth a try.)
SELECT carNumber
FROM ( SELECT DISTINCT carNumber
FROM event
WHERE sourceId = 3
AND createdOn >= '2016-09-20 20:24:00'
AND createdOn < '2016-09-20 20:45:00'
) AS x
WHERE EXISTS ( SELECT * FROM event
WHERE carNumber = x.carNumber
AND sourceId IN (26,44)
AND createdOn >= '2016-09-20 20:24:00'
AND createdOn < '2016-09-20 20:45:00'
);
It would need two indexes:
(sourceId, createdOn, carNumber) -- as before
(carNumber, sourceId, createdOn) -- to optimize the EXISTS
来源:https://stackoverflow.com/questions/39600514/get-the-cars-that-passed-specific-cameras