Optimizing query to get entire row where one field is the maximum for a group

问题

I have a table with a schema like, say,

EventTime   DATETIME(6),
EventType   VARCHAR(20),
Number1     INT,
Number2     INT,
Number3     INT,
...

There are an unimaginably large number of rows in this table, but for the sake of this query I'm only interested in, say, a few thousand of them that are between two given values of EventTime. There's an index on EventTime, and if I just do something like

SELECT * FROM table WHERE EventTime >= time1 and EventTime <= time2;

Then it's able to return the relevant rows near-instantaneously.

Out of the rows in this time window, I want to extract precisely those where Number1 is the largest for any row with that EventType. So in other words I want to do something equivalent to this query:

SELECT * FROM
  (SELECT EventType, MAX(Number1) as max_Number1
   FROM table
   WHERE EventTime >= time1 AND EventTime <= time2
   GROUP BY EventType) AS a
  LEFT JOIN
  (SELECT * FROM table
   WHERE EventTime >= time1 AND EventTime <= time2) AS b
  ON a.EventType = b.EventType AND a.max_Number1 = b.Number1)

This seems like it should work just fine -- I can run each of the subqueries, namely

SELECT EventType, MAX(Number1) as max_Number1
FROM table
WHERE EventTime >= time1 AND EventTime <= time2
GROUP BY EventType;

and

SELECT * FROM table
WHERE EventTime >= time1 AND EventTime <= time2;

virtually instantaneously, so at this point it shouldn't be too hard to produce the desired results: the database can sort or index the results of both subquery by EventType and then just match things up.

However, when I actually run this it takes forever. I don't know how long, because I've never let it complete, but it takes way longer than it would for me to just manually pull the results of both queries and do the merge elsewhere.

Questions:

Why is it taking so long? What is the database engine doing?
Is there a way to write this is a query in such a way that it will perform reasonably?
If not, can I write it as a stored procedure somehow?

Difficulty: As this table has tens of billions of rows it would be quite costly to add any further indices to it.

回答1:

You actually are already pretty close to a good query. The main drawback of yours is likely the LEFT JOIN on selecting all from table in the time frame. Try the following:

SELECT * FROM
table b
INNER JOIN (
    SELECT EventType, MAX(Number1) as max_Number1
    FROM table
    WHERE EventTime >= time1 AND EventTime <= time2
    GROUP BY EventType
) AS a
ON a.EventType = b.EventType
AND a.max_Number1 = b.Number1
WHERE b.EventTime >= time1 AND b.EventTime <= time2

Ideally, this would be accompanied by an index (EventType,EventTime). Please provide the SHOW CREATE TABLE table in your question, so we can see what indexes you currently have. We may be able to tweak an existing one, or help you drop unneeded ones, to permit adding this new index.

Disclaimer: My experience is pretty exclusively in MySQL and InnoDB, but I think this should still be helpful for MariaDB and MyISAM.

来源：https://stackoverflow.com/questions/52414269/optimizing-query-to-get-entire-row-where-one-field-is-the-maximum-for-a-group

标签

mariadb

query-optimization

greatest-n-per-group

MyISAM