问题
I've searched a lot about a solution of the described below case, but unfortunately I haven't found a similar case.
I have the following scenario: (as a new user the site rejected my picture, but I can send it via mail. Below is a textual representation of it)
Table 1 "swap_plan" Table 2 "cell"
ClusterName | SiteID SiteID | Cell | Time | Counter
----------------------- ---------------------------------------------
Cluster A | SiteID A1 SiteID A1 | Cell A1-1 | day1 | 5
Cluster A | SiteID A2 SiteID A1 | Cell A1-1 | day2 | 3
Cluster A | SiteID A3 SiteID A1 | Cell A1-1 | day3 | 6
Cluster A | SiteID A4 SiteID A1 | Cell A1-2 | day1 | 6
Cluster A | SiteID A5 SiteID A1 | Cell A1-2 | day2 | 2
Cluster A | SiteID A6 SiteID A1 | Cell A1-2 | day3 | 9
....................... ..............................................
Cluster B | ......... ..............................................
(Where No 1) (ON Clause "SiteID") (Where No 2) Sum(Counter)
I have to display some performance indicators ("Counter" from table 2 "cell"), aggregated over time ("Time" from table 2 "cell") and cluster ("ClusterName" from table 1 "swap_plan").
The join is done via the common column for both tables "SiteID". Please, note, that in Table 2 "cell" each SiteID consists of 3 different objects ("Cell"). So, in fact I do SUM() of "Counter" for each Cell.
The query is following:
SELECT ClusterName,Time,SUM(counter)
FROM cell
INNER JOIN swap_plan ON swap_plan.Siteid = cell.Siteid
WHERE ClusterName='Cluster A' AND Time>=day1 AND Time<=day2
GROUP BY Time
Column Types are following:
Table 1 "swap plan":
- ClusterName - CHAR(30)
- SiteID - VARCHAR(10)
Table 2 "cell":
- SiteID - VARCHAR(10)
- Time - DATETIME
- Counter - INT
"Explain" showed following:
table type key key_len ref rows Extra
swap_plan ref Index 1 30 const 31 Using where; Using index; Using temporary; Using filesort
cell ref Index_siteid 13 swap_plan.SiteID 368 Using where
The used indexes are following:
swap_plan: Index 1 (1. ClusterName and 2. SiteID)
cell: Index_siteid (SiteID)
The number of rows, in which the optimizer looks is rathet low, which is good:
swap_plan: 31 out of 6066 and cell: 368 out of 6.6 mil.
My problem is these "Using temporary; Using filesort". As far as I understand this comes from the sorting needed for Group By (If I remove it, these processes are not executed according to Explain). I found that in order to avoid them you need to have an index on the columns by which you group by. I have a special index including only "Time" column, but this one is not used, even with a hint "USE INDEX FOR GROUP BY ()".
As a result my query runs not sufficiently fast - it takes about 15 seconds (for let's say 15 SiteIDs and 10 dates) and I need to reduce this duration to at least half of it.
My main questions are:
- Is is possible at all to remove "Using temporary; Using filesort" or to reduce the needed time for their execution? (I tried to increase Read Buffer Size to 16MB, without effect)
- What kind of index definitions I need in JOIN situations, when in WHERE clause I filter by 2 columns in different tables and in ON clause I filter by a 3rd column
- What kind of Group By optimization I can apply (indexing, etc.)?
Thank you very much in advance!
回答1:
I'd write the query like this:
SELECT c.time
, SUM(c.counter)
, MAX(p.clustername) AS clustername
FROM cell c
JOIN swap_plan p
ON p.siteid = c.siteid
AND p.clustername = 'Cluster A'
WHERE c.time >= 'day1'
AND c.time <= 'day2'
GROUP
BY c.time
I'd be sure to have an index on cell
with time
as the leading column.
MySQL can use the same index to satisfy the range predicate (in the WHERE clause), and to satisfy the GROUP BY without a "Using filesort" operation.
... ON cell (time)
Depending on the sizes of the columns, a covering index might give optimal performance. A covering index includes all of the columns from the table that are referenced in the query, so the query can be satisfied entirely from index pages without lookup to pages in the underlying table.
... ON cell (time, siteid, counter)
For the index on swap_plan
, I'd have an index with site_id
as the leading column, and including the clustername
column, either of:
... ON swap_plan (clustername, site_id)
or
... ON swap_plan (site_id, clustername)
Looks likely there is going to be a UNIQUE constraint on the combination of those two columns, i.e. the values of site_id
will be distinct for a given clustername
. (If that isn't the case, and the same (site_id,clustername)
tuple appears multiple times, there's potential for aggregate total of counter
to be inflated.
I'd be looking for the EXPLAIN
output to show a 'ref' lookup to swap_plan
table from the value of c.siteid
and const (literal 'Cluster A') value for clustername.
With tables sized at 31 rows and 368 rows, we aren't going to see a significant difference in performance (elapsed time) between an optimal execution plan and a horrible execution plan.
When either of the tables scales up to millions of rows, that's when the differences will become apparent. The optimizers choice of execution plan is influenced by statistics (size, number of rows, column cardinality) of each table, so the execution plan could change with an increase in table sizes.
来源:https://stackoverflow.com/questions/14458367/correct-indexing-by-join-where-group-by-select-queries-avoiding-using-temporary