SQL Performance: Using OR is slower than IN when using order by

安稳与你 提交于 2019-12-06 11:58:29

Removing the ORDER BY runs so much faster because it can stop after only 1000 rows. How many rows match that OR/IN?

Notice that the EXPLAINs say that the query is Using index. That means that you have a "covering" index. That is all the fields in the SELECT are in the one index.

In InnoDB, each secondary key implicitly includes the PK, so INDEX(zip_code, timestamp_updated) is effectively INDEX(zip_code, timestamp_updated, primaryKey)

The index is not very efficient since you have two non-trivial things going on: (1) IN or OR, (2) ORDER BY. Only one or the other can be handled by the index. Your index lets it use zip_code. It

  1. finds the rows in the index that match any of those zipcodes,
  2. gathers the timestamp and pk, putting the 3 columns in a tmp table
  3. sorts
  4. delivers the first 1000.

If, instead, you said INDEX(timestamp_updated, zip_code) you would still have a 'covering' index, but in this flavor, the index would (I hope) prevent the need for the SORT. Oh, given that, it might be able to stop after 1000 rows. Here's how it will work:

  1. Scan through the index in timestamp order.
  2. Check each row for being one of those zips. (Here the test might be faster in IN format)
  3. If match, deliver row; if 1000, stop.

But wait... Now you are at the mercy of the 12M rows. If 1000 rows with those zips occur early (old timestamps), it can stop fast. If you need to check all the rows to find 1000 (or there aren't even 1000), then it is a full index scan, and this arrangement of the index is 'bad'.

If you give the optimizer both INDEXes, it will dutifully make an intelligent choice based on inadequate information (no distribution of the values), and might pick the worse one.

You effectively need a 2-dimensional index. Such don't exist. (Well, maybe Spatial could be kludged?) But...

PARTITION BY RANGE(timestamp) together with the INDEX starting with zip might work better. But I doubt if the optimizer is smart enough to realize that if it found 1000 rows in the first partition it could quit. And it still fails badly if there aren't 1000 results.

PARTITION BY RANGE(zip) together with the INDEX starting with timestamp probably will not help, since that many zips won't do much pruning.

Please provide EXPLAIN FORMAT=JSON SELECT...; for each of your attempts. There may be some subtle clues there to explain the wide time variations.

Did you run each timing twice? (Otherwise, caching may have colored the results.)

Another approach

I do not know how well this will perform, but here goes:

SELECT  primary_key
    FROM  ( 
              ( SELECT  primary_key, timestamp_updated
                    FROM  texas_parcels
                    WHERE  zip_code = '28461'
                    ORDER BY  timestamp_updated
                    LIMIT  1000 
              )
            UNION  ALL (
                SELECT  primary_key, timestamp_updated
                    FROM  texas_parcels
                    WHERE  zip_code = '48227'
                    ORDER BY  timestamp_updated
                    LIMIT  1000 
                       )
            UNION  ALL (
                SELECT  primary_key, timestamp_updated
                    FROM  texas_parcels
                    WHERE  zip_code = '60411'
                    ORDER BY  timestamp_updated
                    LIMIT  1000 ) ... 
          ) x
    ORDER BY  timestamp_updated
    LIMIT  1000 

It seems like x will have only a few hundred thousand rows, not 1.3M. But UNION has some overhead, etc. Note the LIMIT in each subquery and on the outside. If you need OFFSET, too, it gets trickier.

You have a pretty long list of zip codes that you are comparing. MySQL has an optimization that affects why the execution time without the order by is a bit different. With a list of constants, MySQL sorts the list and does a binary search. I could see this explaining the last results.

With the order by, I am not sure. The actual execution might be affected by other things running on the server. Do you know if anything else is running?

The MYSQL has an optimization , your select when you use OR the number of comparisons increases.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!