MySQL: Optimal index for between queries

后端 未结 5 795
南笙
南笙 2021-02-19 13:26

I have a table with the following structure:

CREATE TABLE `geo_ip` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `start_ip` int(10) unsigned NOT NULL,
  `end_ip         


        
相关标签:
5条回答
  • 2021-02-19 14:06

    Not sure why, but adding an order by clause and limit to the query seems to always result in an index hit, and executes in a few milliseconds instead of a few seconds.

    explain select * from geo_ip where 2393196360 between start_ip and end_ip order by start_ip desc limit 1;
    +----+-------------+--------+-------+-----------------+----------+---------+------+--------+-------------+
    | id | select_type | table  | type  | possible_keys   | key      | key_len | ref  | rows   | Extra       |
    +----+-------------+--------+-------+-----------------+----------+---------+------+--------+-------------+
    |  1 | SIMPLE      | geo_ip | range | start_ip,end_ip | start_ip | 4       | NULL | 975222 | Using where |
    +----+-------------+--------+-------+-----------------+----------+---------+------+--------+-------------+
    

    This is good enough for me now, although I would love to know the reasoning behind why the optimizer decides not to use the index in the other case.

    0 讨论(0)
  • 2021-02-19 14:07

    Best index for BETWEEN queries are B-TREE indices. See MySQL docs on that Topic.

    ALTER TABLE myTable ADD INDEX myIdx USING BTREE (myCol)
    
    0 讨论(0)
  • 2021-02-19 14:15

    If you create an index for start_ip and one for end_ip, I found I could get comparable results to Jeshurun's results without doing the order by, using an inner join with the same table:

    select a.* from geo_ip a inner join geo_ip b on a.id=b.id where 2393196360 >= a.start_ip and 2393196360 <= b.end_ip limit 1;
    

    Also you will find MySQL uses a partial index instead of reporting a full-index scan which is more comforting to me.

    0 讨论(0)
  • 2021-02-19 14:18

    Adding indices will help.

    Note: If your query is sth like

    where x between a and b AND y between c and d
    

    , a INDEX(x, y) will not improve the performance, but two seperate indices for x and y will.

    0 讨论(0)
  • 2021-02-19 14:28

    I've just run into the same problem. Since nobody answered the "WHY", and I figured it out, I'll write here an explanation for all future readers.

    First, let's dissect the query.

    where 2393196360 between start_ip and end_ip
    

    really means

    where start_ip <= C and end_ip >= C
    

    so the engine will first use the index on start_ip, end_ip to fetch all rows for which start_ip is smaller than C, and then further filter out the rows for which end_ip is also bigger than C.

    When the engine looks for start_ip <= C, and C is a value big enough such that most, or all start_ips are smaller than C, this "first pass" will result in a lot of rows. It will happen every time C is an IP on the higher end of the IP range.

    Now, here's the main thing to realise: our dataset is made in such a way that for each start_ip, there is only an end_ip value, and this end_ip value is guaranteed to be lower than the next record's start_ip value. We are partitioning a range and the partitions do not overlap. But, in the general case, when it comes to two table fields, this does not have to be the case!

    So, after the 'first pass', the engine will have to look through ALL records that match start_ip <= C to make sure that they also match end_ip >= C, despite the index. Having end_ip as part of the compound index does not do much in our case; it would help only if we had multiple values for end_ip for each value start_ip, but we only have 1. To give you an example, pretend that the columns were populated with the following data:

    start_ip  end_ip
    1         10001
    1         10002
    1         10003
    ------------
    2         10001
    2         10002
    2         10003
    ------------
    ...
    ------------
    9999      10001
    9999      10002
    9999      10003
    

    if you ran a query with start_ip <= 10000 AND end_ip >= 10000, notice that ALL rows match the expression. On the other hand, in our case, with our ip-ranges dataset, we have the guarantee that only ONE record will match any start_ip <= C AND end_ip >= C expression, thanks to the way the ip data is structured. Specifically the record with the biggest value for start_ip, among all those that match start_ip <= C. That's why adding ORDER BY and LIMIT 1 works in this case, and is the cleanest solution, in my opinion.


    Edit: I've just noticed that adding the ORDER BY start_ip DESC and LIMIT clauses may not be enough in some cases. If you run the query with a value that is not covered by any ranges in your data, for instance with private IPs like 127.0.0.1 or 192.168.*, the engine will still look at all records that match the start_ip <= C expression, and the query will be slow. That's because since no records matches the the second part of the expression (end_ip >= C), the LIMIT 1 clause never kicks in.

    The solution I've found is to construct the query with a join so as to force the engine to first grab the record with the biggest value for start_ip where start_ip <= C, and only then check if end_ip is also >= C. Like this:

    SELECT * 
    FROM 
      ( select id FROM geo_ip WHERE start_ip <= C ORDER BY start_ip DESC LIMIT 1 ) limit_ip
      INNER JOIN geo_ip ON limit_ip.id = geo_ip.id
    WHERE geo_ip.end_ip >= C
    

    This query will perform a single lookup, whether or not the specific ip C is covered by the ranges in the table, and it only requires a single index on start_ip (as well as id as the primary key).

    0 讨论(0)
提交回复
热议问题