Correct indexing by Join-Where-Group By select queries avoiding Using temporary; Using filesort

问题

I've searched a lot about a solution of the described below case, but unfortunately I haven't found a similar case.

I have the following scenario: (as a new user the site rejected my picture, but I can send it via mail. Below is a textual representation of it)

Table 1 "swap_plan"          Table 2 "cell"
ClusterName | SiteID         SiteID    | Cell      |  Time       | Counter
-----------------------      ---------------------------------------------
Cluster A   | SiteID A1      SiteID A1 | Cell A1-1 | day1        | 5
Cluster A   | SiteID A2      SiteID A1 | Cell A1-1 | day2        | 3
Cluster A   | SiteID A3      SiteID A1 | Cell A1-1 | day3        | 6
Cluster A   | SiteID A4      SiteID A1 | Cell A1-2 | day1        | 6
Cluster A   | SiteID A5      SiteID A1 | Cell A1-2 | day2        | 2
Cluster A   | SiteID A6      SiteID A1 | Cell A1-2 | day3        | 9
.......................      ..............................................
Cluster B   | .........      ..............................................

(Where No 1)      (ON Clause "SiteID")            (Where No 2)    Sum(Counter)

I have to display some performance indicators ("Counter" from table 2 "cell"), aggregated over time ("Time" from table 2 "cell") and cluster ("ClusterName" from table 1 "swap_plan").

The join is done via the common column for both tables "SiteID". Please, note, that in Table 2 "cell" each SiteID consists of 3 different objects ("Cell"). So, in fact I do SUM() of "Counter" for each Cell.

The query is following:

SELECT ClusterName,Time,SUM(counter)
FROM cell
INNER JOIN swap_plan ON swap_plan.Siteid = cell.Siteid
WHERE ClusterName='Cluster A' AND Time>=day1 AND Time<=day2
GROUP BY Time

Column Types are following:

Table 1 "swap plan":

ClusterName - CHAR(30)
SiteID - VARCHAR(10)

Table 2 "cell":

SiteID - VARCHAR(10)
Time - DATETIME
Counter - INT

"Explain" showed following:

table          type    key           key_len      ref               rows  Extra

swap_plan      ref     Index 1       30           const             31    Using where; Using index; Using temporary; Using filesort
cell           ref     Index_siteid  13           swap_plan.SiteID  368   Using where

The used indexes are following:

swap_plan: Index 1 (1. ClusterName and 2. SiteID)

cell: Index_siteid (SiteID)

The number of rows, in which the optimizer looks is rathet low, which is good:

swap_plan: 31 out of 6066 and cell: 368 out of 6.6 mil.

My problem is these "Using temporary; Using filesort". As far as I understand this comes from the sorting needed for Group By (If I remove it, these processes are not executed according to Explain). I found that in order to avoid them you need to have an index on the columns by which you group by. I have a special index including only "Time" column, but this one is not used, even with a hint "USE INDEX FOR GROUP BY ()".

As a result my query runs not sufficiently fast - it takes about 15 seconds (for let's say 15 SiteIDs and 10 dates) and I need to reduce this duration to at least half of it.

My main questions are:

Is is possible at all to remove "Using temporary; Using filesort" or to reduce the needed time for their execution? (I tried to increase Read Buffer Size to 16MB, without effect)
What kind of index definitions I need in JOIN situations, when in WHERE clause I filter by 2 columns in different tables and in ON clause I filter by a 3rd column
What kind of Group By optimization I can apply (indexing, etc.)?

Thank you very much in advance!

回答1:

I'd write the query like this:

SELECT c.time
     , SUM(c.counter)
     , MAX(p.clustername) AS clustername
  FROM cell c

  JOIN swap_plan p
    ON p.siteid      = c.siteid
   AND p.clustername = 'Cluster A'

 WHERE c.time  >=  'day1'
   AND c.time  <=  'day2'
 GROUP
    BY c.time

I'd be sure to have an index on cell with time as the leading column.

MySQL can use the same index to satisfy the range predicate (in the WHERE clause), and to satisfy the GROUP BY without a "Using filesort" operation.

... ON cell (time)

Depending on the sizes of the columns, a covering index might give optimal performance. A covering index includes all of the columns from the table that are referenced in the query, so the query can be satisfied entirely from index pages without lookup to pages in the underlying table.

... ON cell (time, siteid, counter)

For the index on swap_plan, I'd have an index with site_id as the leading column, and including the clustername column, either of:

... ON swap_plan (clustername, site_id)

... ON swap_plan (site_id, clustername)

Looks likely there is going to be a UNIQUE constraint on the combination of those two columns, i.e. the values of site_id will be distinct for a given clustername. (If that isn't the case, and the same (site_id,clustername) tuple appears multiple times, there's potential for aggregate total of counter to be inflated.

I'd be looking for the EXPLAIN output to show a 'ref' lookup to swap_plan table from the value of c.siteid and const (literal 'Cluster A') value for clustername.

With tables sized at 31 rows and 368 rows, we aren't going to see a significant difference in performance (elapsed time) between an optimal execution plan and a horrible execution plan.

When either of the tables scales up to millions of rows, that's when the differences will become apparent. The optimizers choice of execution plan is influenced by statistics (size, number of rows, column cardinality) of each table, so the execution plan could change with an increase in table sizes.

来源：https://stackoverflow.com/questions/14458367/correct-indexing-by-join-where-group-by-select-queries-avoiding-using-temporary

标签

mysql

join

group-by