Why Index is used only when forced but not by default?

问题

I have around 420 million records in my table. There is an only index on column colC of user_table . Below query returns around 1.5 million records based on colC. But index is not used somehow and return the records 20 to 25 mins

select colA ,ColB , count(*) as count 
from user_table 
where colC >='2019-09-01 00:00:00' 
      and colC<'2019-09-30 23:59:59' 
      and colA in ("some static value") 
      and ColB in (17) 
group by colA ,ColB;

But when I do force index, it starts getting used and returns the record in 2 mins only. My question why MYSQL is not using index by default when fetch time is much lesser with index ? I have recreated the index alongwith repair but nothing works to make it in use by default .

Another observation for information is same query(without force index) works for previous months (having same volume of data) .

Update For the details asked by Evert

CREATE TABLE USER_TABLE ( id bigint(20) unsigned NOT NULL AUTO_INCREMENT, COLA varchar(10) DEFAULT NULL, COLB int(11) DEFAULT NULL, COLC datetime DEFAULT NULL, .... PRIMARY KEY (id), KEYcolA(COLA), KEYcolB(COLB), KEYcolC(COLC) ) ENGINE=MyISAM AUTO_INCREMENT=2328036072 DEFAULT CHARSET=latin1 |

回答1:

for better performance you could try using composite index .. based on the column involved in your where clause
and try to change the IN clause in a inner join
assuming your IN clause content is a set of fixed values you could use union (or a new table with the value you need )

eg using the union (you can do somethings similar if the IN clause is a subquery)

select user_table.colA ,ColB , count(*) as count 
from user_table 
INNER JOIN  ( 
  select 'FIXED1' colA
  union
  select 'FIXED2'
  ....
  union 
  select 'FIXEDX'
  )  t on t.colA = user_table.colA  
where colC >='2019-09-01 00:00:00' 
      and ColB = 17  
group by colA ,ColB;

you could also add a composite index on table user_table on columns

   colA, colB, colC

for what related to element used by mysql query optimizer for decide the index to use there several aspect and for all of these the query optimizer assign a cost
any what you should take in consideration

the column involved in Where clause
The size of the tables (and not yiuy case the size of the tables in join)
An estimation of how many rows will be fetched ( to decide whether to use an index, or simply scan the table )
if the datatypes match or not between columns in the jion and where clause
The use of function or data type conversion including mismacth of collation
The size of the index
cardinality of the index

and for all of these option is evaluated a cost and this lead to the index choose

In you case the colC as date could be implies a data conversion (respect the literal values as string ) and for this the index in not choosed ..

Is also for this that i have suggested a composite index with the left most column related to non converted values

回答2:

Indexes try to get used as best as possible. I cant guarantee, but it SOUNDS like the engine is building a temporary index based on A & B to qualify the static values in your query. For 420+ million is just the time to build such temporary index. By you forcing an index is helping optimize the time otherwise.

Now, if you (and others) don't quite understand indexes, its a way of pre-grouping data to help the optimizer. When you have GROUP BY conditions, those components, where practical, should be part of the index, and TYPICALLY would be part of the criteria as you have in your query.

select colA ,ColB , count(*) as count 
from user_table 
where colC >='2019-09-01 00:00:00' 
      and colC<'2019-09-30 23:59:59' 
      and colA in ("some static value") 
      and ColB in (17) 
group by colA ,ColB;

Now, lets look at your index, and only available based on ColC. Assume that all records are based on a day for scenario purposes. Make pretend each INDEX (single or compound) is stored in its own room. You have an index on just the date column C. In the room, you have 30 boxes (representing Sept 1 to Sept 30), not counting all other boxes for other days. Now, you have to go through each box per day and look for all entries that have a value of ColA and ColB that you want. The stuff in the box is not sorted, so you have to look at every record. Now, do this for the all 30 days of September.

Now, simulate the NEXT index, boxes stored in another room. This room is a compound index based on (and in this order to help optimize you query), Columns A, B and C. So now, you could have 100 entries for "A". You only care about ColA = "some static value", so you grab that one box.

Now, you open that box and see a bunch of smaller boxes... Oh.. These are all the individual "Column B" records. On the top of each box represents each individual "B" entries so you find the 1 box with the value 17.

Finally, now you open Box B and look in side. Wow... they are all nicely sorted for you by date. So now, you scroll quickly to find Sept 1 and pull all entries up to Sept 30 you are looking for.

By quickly getting to the source by an optimized index will help you in the long run. Having an index on

(colA, colB, colC)

will significantly help your query performance.

One final note. Since you are only querying for a single "A" and single "B" value, you would only get a single row back and would not need a group by clause (in this case).

Hope this helps you and others better understand how indexes work from just individual vs compound (multi-columns).

One additional advantage of a multi-column index. Such as in this case where all the columns are part of the index, the database does not have to go to the raw data pages to confirm the other columns. Meaning you are looking only at the values A, B and C. All these fields are part of the index. It does not have to go back to the raw data pages where the actual data is stored to confirm its qualification to be returned.

In a single column index such as yours, it uses the index to find what records qualify (by date in this case). Then on an each record basis, it has to go to the raw data page holding the entire record (could have 50 columns in a record) just to confirm if the A and B columns qualify, then discard if not applicable. Then go back to the index by date, then back to the raw data page to confirm its A and B... You can probably understand much more time to keep going back and forth.

The second index already has "A", "B" and the pre-sorted date range of "C". Done without having to go to the raw data pages.

来源：https://stackoverflow.com/questions/58551537/why-index-is-used-only-when-forced-but-not-by-default

标签

mysql

indexing

MyISAM