MySQL: Optimal index for between queries

早过忘川 提交于 2020-01-01 08:43:26

问题


I have a table with the following structure:

CREATE TABLE `geo_ip` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `start_ip` int(10) unsigned NOT NULL,
  `end_ip` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id`),
  KEY `start_ip` (`start_ip`),
  KEY `end_ip` (`end_ip`),
  KEY `start_end` (`start_ip`,`end_ip`),
  KEY `end_start` (`end_ip`,`start_ip`)) ENGINE=InnoDB;

MySQL seems to be unable to use the indexes for most of my queries, as the where clause uses a between that falls somewhere between start_ip and end_ip:

select * from geo_ip where 2393196360 between start_ip and end_ip;

+----+-------------+--------+------+-------------------------------------+------+---------+------+---------+-------------+
| id | select_type | table  | type | possible_keys                       | key  | key_len | ref  | rows    | Extra       |
+----+-------------+--------+------+-------------------------------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | geo_ip | ALL  | start_ip,end_ip,start_end,end_start | NULL | NULL    | NULL | 2291578 | Using where |
+----+-------------+--------+------+-------------------------------------+------+---------+------+---------+-------------+

The table has a few million records. I tried expanding the table, by removing the start_ip and end_ip columns, and creating a row for every possible value of start_ip and end_ip as the id, then querying the id. While that vastly improved query performance, it resulted in the table size growing from less than a gigabyte to tens of gigabytes (the table has other columns obviously).

What else can be done to improve query performance? Can I change the query somehow, or can I index the columns differently to result in a hit? Or perhaps something I haven't thought of yet?

Edit:

Strangely, the index is used for certain values. For example:

explain select * from geo_ip where 3673747503 between start_ip and end_ip;
+----+-------------+--------+-------+-------------------------------------+--------+---------+------+-------+-------------+
| id | select_type | table  | type  | possible_keys                       | key    | key_len | ref  | rows  | Extra       |
+----+-------------+--------+-------+-------------------------------------+--------+---------+------+-------+-------------+
|  1 | SIMPLE      | geo_ip | range | start_ip,end_ip,start_end,end_start | end_ip | 4       | NULL | 19134 | Using where |
+----+-------------+--------+-------+-------------------------------------+--------+---------+------+-------+-------------+

回答1:


Not sure why, but adding an order by clause and limit to the query seems to always result in an index hit, and executes in a few milliseconds instead of a few seconds.

explain select * from geo_ip where 2393196360 between start_ip and end_ip order by start_ip desc limit 1;
+----+-------------+--------+-------+-----------------+----------+---------+------+--------+-------------+
| id | select_type | table  | type  | possible_keys   | key      | key_len | ref  | rows   | Extra       |
+----+-------------+--------+-------+-----------------+----------+---------+------+--------+-------------+
|  1 | SIMPLE      | geo_ip | range | start_ip,end_ip | start_ip | 4       | NULL | 975222 | Using where |
+----+-------------+--------+-------+-----------------+----------+---------+------+--------+-------------+

This is good enough for me now, although I would love to know the reasoning behind why the optimizer decides not to use the index in the other case.




回答2:


I've just run into the same problem. Since nobody answered the "WHY", and I figured it out, I'll write here an explanation for all future readers.

First, let's dissect the query.

where 2393196360 between start_ip and end_ip

really means

where start_ip <= C and end_ip >= C

so the engine will first use the index on start_ip, end_ip to fetch all rows for which start_ip is smaller than C, and then further filter out the rows for which end_ip is also bigger than C.

When the engine looks for start_ip <= C, and C is a value big enough such that most, or all start_ips are smaller than C, this "first pass" will result in a lot of rows. It will happen every time C is an IP on the higher end of the IP range.

Now, here's the main thing to realise: our dataset is made in such a way that for each start_ip, there is only an end_ip value, and this end_ip value is guaranteed to be lower than the next record's start_ip value. We are partitioning a range and the partitions do not overlap. But, in the general case, when it comes to two table fields, this does not have to be the case!

So, after the 'first pass', the engine will have to look through ALL records that match start_ip <= C to make sure that they also match end_ip >= C, despite the index. Having end_ip as part of the compound index does not do much in our case; it would help only if we had multiple values for end_ip for each value start_ip, but we only have 1. To give you an example, pretend that the columns were populated with the following data:

start_ip  end_ip
1         10001
1         10002
1         10003
------------
2         10001
2         10002
2         10003
------------
...
------------
9999      10001
9999      10002
9999      10003

if you ran a query with start_ip <= 10000 AND end_ip >= 10000, notice that ALL rows match the expression. On the other hand, in our case, with our ip-ranges dataset, we have the guarantee that only ONE record will match any start_ip <= C AND end_ip >= C expression, thanks to the way the ip data is structured. Specifically the record with the biggest value for start_ip, among all those that match start_ip <= C. That's why adding ORDER BY and LIMIT 1 works in this case, and is the cleanest solution, in my opinion.


Edit: I've just noticed that adding the ORDER BY start_ip DESC and LIMIT clauses may not be enough in some cases. If you run the query with a value that is not covered by any ranges in your data, for instance with private IPs like 127.0.0.1 or 192.168.*, the engine will still look at all records that match the start_ip <= C expression, and the query will be slow. That's because since no records matches the the second part of the expression (end_ip >= C), the LIMIT 1 clause never kicks in.

The solution I've found is to construct the query with a join so as to force the engine to first grab the record with the biggest value for start_ip where start_ip <= C, and only then check if end_ip is also >= C. Like this:

SELECT * 
FROM 
  ( select id FROM geo_ip WHERE start_ip <= C ORDER BY start_ip DESC LIMIT 1 ) limit_ip
  INNER JOIN geo_ip ON limit_ip.id = geo_ip.id
WHERE geo_ip.end_ip >= C

This query will perform a single lookup, whether or not the specific ip C is covered by the ranges in the table, and it only requires a single index on start_ip (as well as id as the primary key).




回答3:


Best index for BETWEEN queries are B-TREE indices. See MySQL docs on that Topic.

ALTER TABLE myTable ADD INDEX myIdx UNSING BTREE (myCol)



回答4:


If you create an index for start_ip and one for end_ip, I found I could get comparable results to Jeshurun's results without doing the order by, using an inner join with the same table:

select a.* from geo_ip a inner join geo_ip b on a.id=b.id where 2393196360 >= a.start_ip and 2393196360 <= b.end_ip limit 1;

Also you will find MySQL uses a partial index instead of reporting a full-index scan which is more comforting to me.




回答5:


Adding indices will help.

Note: If your query is sth like

where x between a and b AND y between c and d

, a INDEX(x, y) will not improve the performance, but two seperate indices for x and y will.



来源:https://stackoverflow.com/questions/22125474/mysql-optimal-index-for-between-queries

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!