Get the cars that passed specific cameras

大憨熊 提交于 2019-12-25 08:27:12

问题


MYSQL/MARIADB Schema and sample data:

CREATE DATABASE IF NOT EXISTS `puzzle` DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_unicode_ci;

USE `puzzle`;

DROP TABLE IF EXISTS `event`;

CREATE TABLE `event` (
  `eventId` bigint(20) NOT NULL AUTO_INCREMENT,
  `sourceId` bigint(20) NOT NULL COMMENT 'think of source as camera',
  `carNumber` varchar(40) NOT NULL COMMENT 'ex: 5849',
  `createdOn` datetime DEFAULT NULL,
  PRIMARY KEY (`eventId`)
) ENGINE=INNODB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;


INSERT INTO `event` (`eventId`, `sourceId`, `carNumber`, `createdOn`) VALUES
    (1, 44, '4456', '2016-09-20 20:24:05'),
    (2, 26, '26484', '2016-09-20 20:24:05'),
    (3, 5, '4456', '2016-09-20 20:24:06'),
    (4, 3, '72704', '2016-09-20 20:24:15'),
    (5, 3, '399606', '2016-09-20 20:26:15'),
    (6, 5, '4456', '2016-09-20 20:27:25'),
    (7, 44, '72704', '2016-09-20 20:29:25'),
    (8, 3, '4456', '2016-09-20 20:30:55'),
    (9, 44, '26484', '2016-09-20 20:34:55'),
    (10, 26, '4456', '2016-09-20 20:35:15'),
    (11, 3, '72704', '2016-09-20 20:35:15'),
    (12, 3, '399606', '2016-09-20 20:44:35'),
    (13, 26, '4456', '2016-09-20 20:49:45');

I want to get CarNumber(s) that have sourceId = 3 AND (26 OR 44) during 20:24 to 20:45. the query need to be fast since the real table contains over 300 million records.

so far below is the maximum i could go with the query (its not even producing valid results)

select * from event e where 
e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00' 
and e.sourceId IN(3,26,44) group by e.carNumber;

the correct results for the provided data:

carNumber
4456
72704

I am really puzzled and stuck. i tried EXISTS, Joins, sub-query without luck, so I wonder if SQL is able to solve this question or should I use backend coding?

MySQL / MariaDB version in use:

mariadb-5.5.50

mysql-5.5.51


回答1:


If you need this to be fast, then the following might work, assuming you have an index on event(createdOn, carNumber, SourceId):

select e.carNumber 
from event e 
where e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00'
group by e.carNumber
having sum(e.sourceId = 3) > 0 and
       sum(e.sourceId IN (26, 44)) > 0;

I would be inclined to change this to:

select e.carNumber 
from event e 
where e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00' and
      e.sourceId in (3, 26, 44)
group by e.carNumber
having sum(e.sourceId = 3) > 0 and
       sum(e.sourceId IN (26, 44)) > 0;

And then for performance, even this:

select carNumber
from ((select carNumber, sourceId
       from event e
       where e.sourceId = 3 and
             e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00'
      ) union all
      (select carNumber, sourceId
       from event e
       where e.sourceId = 26 and
             e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00'
      ) union all
      (select carNumber, sourceId
       from event e
       where e.sourceId = 44 and
             e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00'
      )
     ) e
group by e.carNumber
having sum(e.sourceId = 3) > 0 and
       sum(e.sourceId IN (26, 44)) > 0;

This version can take advantage of an index on event(sourceId, createdOn, carNumber). Each subquery should use this index very effectively, bringing a small'ish amount of data together for the final aggregation.




回答2:


You can use the having clause to filter on groups. Use sum() to count how many times certains conditions are present in a group of data

select e.carNumber 
from event e 
where e.createdOn > '2016-09-20 20:24:00' 
  and e.createdOn < '2016-09-20 20:45:00'
group by e.carNumber
having sum(e.sourceId = 3) > 0
   and sum(e.sourceId IN (26,44)) > 0



回答3:


Something like the following should do the trick for you:

 SELECT carNumber
 FROM event
 WHERE sourceID = 3
     AND carNumber IN (SELECT carNumber FROM event WHERE sourceID IN(26,44))
 GROUP BY carNumber

That WHERE clause looks for records with a sourceID of 3 and then also makes sure that the carnumber has at least one other record in the table where the sourceid is either 26 or 44

Don't code anything outside of SQL for this one since this is definitely a problem that SQL is built to solve as quickly as possible.




回答4:


Shrink the table size

With 300M rows, you should really use the smallest datatypes that are practical.

  • BIGINT takes 8 bytes; INT UNSIGNED (only 4 bytes) is usually sufficient (max of 4 billion). If fewer than 65K cameras, use a 2-byte SMALLINT UNSIGNED.

  • carNumber looks like a number, so why use VARCHAR? The examples you have are taking 5-7 bytes in VARCHAR, would fit in 4 bytes with INT UNSIGNED or 3 bytes with MEDIUMINT UNSIGNED (max of 16M).

Shrinking the table will help any solution chosen.

Covering index

This has already been suggested in other answers, but I want to make it clear why it helps. If all the columns exist in a single query, the query can be performed in the index's BTree, without touching the data. This is usually faster due to being smaller. A 'covering' index for this query has source_id, car_number, createdOn in any order.

Order of columns in index

Since an index can only be used left-to-right the order is important. (This does not apply to Gordon's first select, which needs createdOn first.)

  1. sourceId is handled with = or IN, so it should come first. In the case of IN, you probably need 5.6 or later to get the IN optimizations.
  2. createdOn is a range, so the lookup will stop with it.
  3. For "covering", now any extra columns can be added on. In this case, carNumber.

So, most (not all) suggestions want this order: INDEX(sourceId, createdOn, carNumber).

Get rid of auto_increment

Do you use eventID in other tables? If so, then you should probably keep it. If not, then is the combo (sourceId, createdOn, carNumber) unique? If so, then make that the PRIMARY KEY. Surrogate PK is nice for some situations, but it hinders performance in others. I am suggesting that it may be a hindrance here.

Avoid slow operations

UNION usually involves a temp table; this adds overhead. While UNION is beneficial in making better use of indexes, and avoiding OR, the overhead of the tmp table may outweigh the benefits for what seems to be a small resultset.

Gordon was right to use UNION ALL instead of the default UNION DISTINCT; the latter needs a de-dup pass, which is unnecessary for his query.

Bottom Line

  1. Shrink the table.
  2. Change the PK if possible; if not, add the suggested index.
  3. Upgrade to at least 5.6
  4. Use Gordon's second query.

Another solution

(I don't know if this is better, but it might be worth a try.)

SELECT carNumber 
    FROM ( SELECT DISTINCT carNumber
           FROM event
           WHERE sourceId = 3
             AND createdOn >= '2016-09-20 20:24:00'
             AND createdOn  < '2016-09-20 20:45:00'
         ) AS x
    WHERE EXISTS ( SELECT * FROM event
            WHERE carNumber = x.carNumber
              AND sourceId IN (26,44)
              AND createdOn >= '2016-09-20 20:24:00'
              AND createdOn  < '2016-09-20 20:45:00'
                 );

It would need two indexes:

(sourceId, createdOn, carNumber)  -- as before
(carNumber, sourceId, createdOn)  -- to optimize the EXISTS


来源:https://stackoverflow.com/questions/39600514/get-the-cars-that-passed-specific-cameras

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!