MySQL: Why 5th ID in the IN clause drastically changes query plan?

问题

Given the following two queries:

Query #1

SELECT log.id
FROM log
WHERE user_id IN
      (188858, 188886, 189854, 203623, 204072)
      and type in (14, 15, 17)
ORDER BY log.id DESC
LIMIT 25 OFFSET 0;

Query #2 - 4 IDs instead 5

SELECT log.id
FROM log
WHERE user_id IN
      (188858, 188886, 189854, 203623)
      and type in (14, 15, 17)
ORDER BY log.id DESC
LIMIT 25 OFFSET 0;

Explain Plan

-- Query #1
1   SIMPLE  log range   idx_user_id_and_log_id  idx_user_id_and_log_id  4       41280   Using index condition; Using where; Using filesort
-- Query #2
1   SIMPLE  log index   idx_user_id_and_log_id  PRIMARY                 4       53534   Using where

Why the addition of a single ID makes the execution plan so different? I'm talking about a difference in time of milliseconds to ~1 minute. I thought that it could be related to the eq_range_index_dive_limit parameters, but it's bellow 10 anyway (the default). I know that I can force the usage of the index instead of the clustered index, but I wanted to know why MySQL decided that.

Should I try to understand that? Or sometimes it's not possible to understand query planner decisions?

Extra Details

Table Size: 11GB
Rows: 108 Million
MySQL: 5.6.7
Doesn't matter which ID is removed from the IN clause.
The index: idx_user_id_and_log_id(user_id, id)

回答1:

As you have shown, MySQL has two alternative query plans for queries with ORDER BY ... LIMIT n:

Read all qualifying rows, sort them, and pick the n top rows.
Read the rows in sorted order and stop when n qualifying rows have been found.

In order to decide which is the better option, the optimizer needs to estimate the filtering effect of your WHERE condition. This is not straight-forward, especially for columns that are not indexed, or for columns where values are correlated. In your case, one probably has to read a lot more of the table in sorted order in order to find the first 25 qualifying rows than what the optimizer expected.

There have been several improvements in how LIMIT queries are handled, both in later releases of 5.6 (you are running on a pre-GA release!), and in newer releases (5.7, 8.0). I suggest you try to upgrade to a later release, and see if this still is an issue.

In general, if you want to understand query planner decisions, you should look at the optimizer trace for the query.

回答2:

JOIN is much more efficient.

Create a temporary table with the values of the IN operator. Then make a JOIN between table 'log' to the temporary table of values.

Refer to this answer for more info.

回答3:

Add

INDEX(user_id, type, id),
INDEX(type, user_id, id)

Each of these is a "covering" index. As such, the entire query can be performed by looking only in one index, without touching the 'data'.

I have two choices for the Optimizer -- hopefully it will be able to pick whether user_id IN (...) is more selective or type IN (...) in order to pick the better index.

If, after adding those, you don't have any use for idx_user_id_and_log_id(user_id, id), DROP it.

(No, I can't explain why query 2 chose to do a table scan.)

来源：https://stackoverflow.com/questions/51793915/mysql-why-5th-id-in-the-in-clause-drastically-changes-query-plan

标签

mysql

sql

indexing

query-performance