问题
I have a table with a schema like, say,
EventTime DATETIME(6),
EventType VARCHAR(20),
Number1 INT,
Number2 INT,
Number3 INT,
...
There are an unimaginably large number of rows in this table, but for the sake of this query I'm only interested in, say, a few thousand of them that are between two given values of EventTime
. There's an index on EventTime
, and if I just do something like
SELECT * FROM table WHERE EventTime >= time1 and EventTime <= time2;
Then it's able to return the relevant rows near-instantaneously.
Out of the rows in this time window, I want to extract precisely those where Number1
is the largest for any row with that EventType
. So in other words I want to do something equivalent to this query:
SELECT * FROM
(SELECT EventType, MAX(Number1) as max_Number1
FROM table
WHERE EventTime >= time1 AND EventTime <= time2
GROUP BY EventType) AS a
LEFT JOIN
(SELECT * FROM table
WHERE EventTime >= time1 AND EventTime <= time2) AS b
ON a.EventType = b.EventType AND a.max_Number1 = b.Number1)
This seems like it should work just fine -- I can run each of the subqueries, namely
SELECT EventType, MAX(Number1) as max_Number1
FROM table
WHERE EventTime >= time1 AND EventTime <= time2
GROUP BY EventType;
and
SELECT * FROM table
WHERE EventTime >= time1 AND EventTime <= time2;
virtually instantaneously, so at this point it shouldn't be too hard to produce the desired results: the database can sort or index the results of both subquery by EventType
and then just match things up.
However, when I actually run this it takes forever. I don't know how long, because I've never let it complete, but it takes way longer than it would for me to just manually pull the results of both queries and do the merge elsewhere.
Questions:
- Why is it taking so long? What is the database engine doing?
- Is there a way to write this is a query in such a way that it will perform reasonably?
- If not, can I write it as a stored procedure somehow?
Difficulty: As this table has tens of billions of rows it would be quite costly to add any further indices to it.
回答1:
You actually are already pretty close to a good query. The main drawback of yours is likely the LEFT JOIN on selecting all from table
in the time frame. Try the following:
SELECT * FROM
table b
INNER JOIN (
SELECT EventType, MAX(Number1) as max_Number1
FROM table
WHERE EventTime >= time1 AND EventTime <= time2
GROUP BY EventType
) AS a
ON a.EventType = b.EventType
AND a.max_Number1 = b.Number1
WHERE b.EventTime >= time1 AND b.EventTime <= time2
Ideally, this would be accompanied by an index (EventType,EventTime)
. Please provide the SHOW CREATE TABLE table
in your question, so we can see what indexes you currently have. We may be able to tweak an existing one, or help you drop unneeded ones, to permit adding this new index.
Disclaimer: My experience is pretty exclusively in MySQL and InnoDB, but I think this should still be helpful for MariaDB and MyISAM.
来源:https://stackoverflow.com/questions/52414269/optimizing-query-to-get-entire-row-where-one-field-is-the-maximum-for-a-group