SQL Performance: Using OR is slower than IN when using order by

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-07 22:46:36

问题


I am using MariaDB 10.0.21 and running a query similar to the following query against 12 Million Rows:

SELECT 
    `primary_key` 
FROM 
    `texas_parcels` 
WHERE 
    `zip_code`
         IN ('28461', '48227', '60411', '65802', '75215', '75440', '75773', '75783', '76501', '76502', '76504', '76511', '76513', '76518', '76519', '76520', '76522', '76525', '76527', '76528', '76530', '76537', '76539', '76541', '76542', '76543', '76548', '76549', '76550', '76556', '76567', '76571', '76574', '76577', '76578', '76642', '76704', '76853', '77418', '77434', '77474', '77833', '77835', '77836', '77845', '77853', '77879', '77964', '77975', '78002', '78003', '78006', '78013', '78028', '78056', '78064', '78070', '78114', '78123', '78130', '78132', '78133', '78154', '78155', '78359', '78382', '78602', '78605', '78606', '78607', '78608', '78609', '78610', '78611', '78612', '78613', '78614', '78615', '78616', '78617', '78619', '78620', '78621', '78623', '78624', '78626', '78628', '78629', '78632', '78633', '78634', '78636', '78638', '78639', '78640', '78641', '78642', '78643', '78644', '78645', '78648', '78650', '78652', '78653', '78654', '78655', '78656', '78657', '78659', '78660', '78662', '78663', '78664', '78665', '78666', '78669', '78672', '78676', '78681', '78699', '78701', '78702', '78703', '78704', '78705', '78717', '78719', '78721', '78722', '78723', '78724', '78725', '78726', '78727', '78728', '78729', '78730', '78731', '78732', '78733', '78734', '78735', '78736', '78737', '78738', '78739', '78741', '78744', '78745', '78746', '78747', '78748', '78749', '78750', '78751', '78752', '78753', '78754', '78756', '78757', '78758', '78759', '78828', '78934', '78940', '78941', '78942', '78945', '78946', '78947', '78948', '78953', '78954', '78956', '78957', '78963', '92536') 
ORDER BY 
    `timestamp_updated` ASC
LIMIT 1000;

I have a compound index on (zip_code,timestamp_updated), I obtain results in ~1.6 Seconds. In the next query I am still seraching the same zipcodes but I am using OR instead of IN().

SELECT 
    `primary_key`
FROM 
    `texas_parcels` 
WHERE
(`zip_code` = '28461' OR `zip_code` = '48227' OR `zip_code` = '60411' OR `zip_code` = '65802' OR `zip_code` = '75215' OR `zip_code` = '75440' OR `zip_code` = '75773' OR `zip_code` = '75783' OR `zip_code` = '76501' OR `zip_code` = '76502' OR `zip_code` = '76504' OR `zip_code` = '76511' OR `zip_code` = '78957' OR `zip_code` = '78963' OR `zip_code` = '92536' OR `zip_code` = '76513' OR `zip_code` = '76518' OR `zip_code` = '76519' OR `zip_code` = '76520' OR `zip_code` = '76522' OR `zip_code` = '76525' OR `zip_code` = '76527' OR `zip_code` = '76528' OR `zip_code` = '76530' OR `zip_code` = '76537' OR `zip_code` = '76539' OR `zip_code` = '76541' OR `zip_code` = '76542' OR `zip_code` = '76543' OR `zip_code` = '76548' OR `zip_code` = '76549' OR `zip_code` = '76550' OR `zip_code` = '76556' OR `zip_code` = '76567' OR `zip_code` = '76571' OR `zip_code` = '76574' OR `zip_code` = '76577' OR `zip_code` = '76578' OR `zip_code` = '76642' OR `zip_code` = '76704' OR `zip_code` = '76853' OR `zip_code` = '77418' OR `zip_code` = '77434' OR `zip_code` = '77474' OR `zip_code` = '77833' OR `zip_code` = '77835' OR `zip_code` = '77836' OR `zip_code` = '77845' OR `zip_code` = '77853' OR `zip_code` = '77879' OR `zip_code` = '77964' OR `zip_code` = '77975' OR `zip_code` = '78002' OR `zip_code` = '78003' OR `zip_code` = '78006' OR `zip_code` = '78013' OR `zip_code` = '78028' OR `zip_code` = '78056' OR `zip_code` = '78064' OR `zip_code` = '78070' OR `zip_code` = '78114' OR `zip_code` = '78123' OR `zip_code` = '78130' OR `zip_code` = '78132' OR `zip_code` = '78133' OR `zip_code` = '78154' OR `zip_code` = '78155' OR `zip_code` = '78359' OR `zip_code` = '78382' OR `zip_code` = '78602' OR `zip_code` = '78605' OR `zip_code` = '78606' OR `zip_code` = '78607' OR `zip_code` = '78608' OR `zip_code` = '78609' OR `zip_code` = '78610' OR `zip_code` = '78611' OR `zip_code` = '78612' OR `zip_code` = '78613' OR `zip_code` = '78614' OR `zip_code` = '78615' OR `zip_code` = '78616' OR `zip_code` = '78617' OR `zip_code` = '78619' OR `zip_code` = '78620' OR `zip_code` = '78621' OR `zip_code` = '78623' OR `zip_code` = '78624' OR `zip_code` = '78626' OR `zip_code` = '78628' OR `zip_code` = '78629' OR `zip_code` = '78632' OR `zip_code` = '78633' OR `zip_code` = '78634' OR `zip_code` = '78636' OR `zip_code` = '78638' OR `zip_code` = '78639' OR `zip_code` = '78640' OR `zip_code` = '78641' OR `zip_code` = '78642' OR `zip_code` = '78643' OR `zip_code` = '78644' OR `zip_code` = '78645' OR `zip_code` = '78648' OR `zip_code` = '78650' OR `zip_code` = '78652' OR `zip_code` = '78653' OR `zip_code` = '78654' OR `zip_code` = '78655' OR `zip_code` = '78656' OR `zip_code` = '78657' OR `zip_code` = '78659' OR `zip_code` = '78660' OR `zip_code` = '78662' OR `zip_code` = '78663' OR `zip_code` = '78664' OR `zip_code` = '78665' OR `zip_code` = '78666' OR `zip_code` = '78669' OR `zip_code` = '78672' OR `zip_code` = '78676' OR `zip_code` = '78681' OR `zip_code` = '78699' OR `zip_code` = '78701' OR `zip_code` = '78702' OR `zip_code` = '78703' OR `zip_code` = '78704' OR `zip_code` = '78705' OR `zip_code` = '78717' OR `zip_code` = '78719' OR `zip_code` = '78721' OR `zip_code` = '78722' OR `zip_code` = '78723' OR `zip_code` = '78724' OR `zip_code` = '78725' OR `zip_code` = '78726' OR `zip_code` = '78727' OR `zip_code` = '78728' OR `zip_code` = '78729' OR `zip_code` = '78730' OR `zip_code` = '78731' OR `zip_code` = '78732' OR `zip_code` = '78733' OR `zip_code` = '78734' OR `zip_code` = '78735' OR `zip_code` = '78736' OR `zip_code` = '78737' OR `zip_code` = '78738' OR `zip_code` = '78739' OR `zip_code` = '78741' OR `zip_code` = '78744' OR `zip_code` = '78745' OR `zip_code` = '78746' OR `zip_code` = '78747' OR `zip_code` = '78748' OR `zip_code` = '78757' OR `zip_code` = '78758' OR `zip_code` = '78759' OR `zip_code` = '78828' OR `zip_code` = '78934' OR `zip_code` = '78940' OR `zip_code` = '78941' OR `zip_code` = '78942' OR `zip_code` = '78945' OR `zip_code` = '78946' OR `zip_code` = '78947' OR `zip_code` = '78948' OR `zip_code` = '78953' OR `zip_code` = '78954' OR `zip_code` = '78956')
ORDER BY 
    `timestamp_updated` ASC
LIMIT 1000;

This second query obtains the same results in the same order in ~7.8 Seconds. When running each query through an explain, they are nearly the same, they give me a slightly different rows amount.

id  select_type     table        type      possible_keys          key            key_len    ref        rows       filtered              Extra
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1     SIMPLE     TX_Property   range    Zip Code Search     Zip Code Search       15     (NULL)      2402699     99.88      Using where; Using index; Using filesort  
 2     SIMPLE     TX_Property   range    Zip Code Search     Zip Code Search       15     (NULL)      2321908     99.91      Using where; Using index; Using filesort 

When Profiling the two queries the only major increase in time is the Sorting Result time, in the second query it took up to 7.2 Seconds.

I guess what I don't understand is how the different operator in the where could make such a huge difference when it comes to order, it would make sense if there was a major difference in time for executing? Maybe I am just not exactly sure how profiling works, and it actually is just the time of executing the where portion, but it is just labeled in a convoluted way?

I also wanted to note that when I ran the queries without the ORDER BY timestamp_updated ASC the top query took ~0.106 Seconds and the second query took ~0.157 Seconds.


回答1:


Removing the ORDER BY runs so much faster because it can stop after only 1000 rows. How many rows match that OR/IN?

Notice that the EXPLAINs say that the query is Using index. That means that you have a "covering" index. That is all the fields in the SELECT are in the one index.

In InnoDB, each secondary key implicitly includes the PK, so INDEX(zip_code, timestamp_updated) is effectively INDEX(zip_code, timestamp_updated, primaryKey)

The index is not very efficient since you have two non-trivial things going on: (1) IN or OR, (2) ORDER BY. Only one or the other can be handled by the index. Your index lets it use zip_code. It

  1. finds the rows in the index that match any of those zipcodes,
  2. gathers the timestamp and pk, putting the 3 columns in a tmp table
  3. sorts
  4. delivers the first 1000.

If, instead, you said INDEX(timestamp_updated, zip_code) you would still have a 'covering' index, but in this flavor, the index would (I hope) prevent the need for the SORT. Oh, given that, it might be able to stop after 1000 rows. Here's how it will work:

  1. Scan through the index in timestamp order.
  2. Check each row for being one of those zips. (Here the test might be faster in IN format)
  3. If match, deliver row; if 1000, stop.

But wait... Now you are at the mercy of the 12M rows. If 1000 rows with those zips occur early (old timestamps), it can stop fast. If you need to check all the rows to find 1000 (or there aren't even 1000), then it is a full index scan, and this arrangement of the index is 'bad'.

If you give the optimizer both INDEXes, it will dutifully make an intelligent choice based on inadequate information (no distribution of the values), and might pick the worse one.

You effectively need a 2-dimensional index. Such don't exist. (Well, maybe Spatial could be kludged?) But...

PARTITION BY RANGE(timestamp) together with the INDEX starting with zip might work better. But I doubt if the optimizer is smart enough to realize that if it found 1000 rows in the first partition it could quit. And it still fails badly if there aren't 1000 results.

PARTITION BY RANGE(zip) together with the INDEX starting with timestamp probably will not help, since that many zips won't do much pruning.

Please provide EXPLAIN FORMAT=JSON SELECT...; for each of your attempts. There may be some subtle clues there to explain the wide time variations.

Did you run each timing twice? (Otherwise, caching may have colored the results.)

Another approach

I do not know how well this will perform, but here goes:

SELECT  primary_key
    FROM  ( 
              ( SELECT  primary_key, timestamp_updated
                    FROM  texas_parcels
                    WHERE  zip_code = '28461'
                    ORDER BY  timestamp_updated
                    LIMIT  1000 
              )
            UNION  ALL (
                SELECT  primary_key, timestamp_updated
                    FROM  texas_parcels
                    WHERE  zip_code = '48227'
                    ORDER BY  timestamp_updated
                    LIMIT  1000 
                       )
            UNION  ALL (
                SELECT  primary_key, timestamp_updated
                    FROM  texas_parcels
                    WHERE  zip_code = '60411'
                    ORDER BY  timestamp_updated
                    LIMIT  1000 ) ... 
          ) x
    ORDER BY  timestamp_updated
    LIMIT  1000 

It seems like x will have only a few hundred thousand rows, not 1.3M. But UNION has some overhead, etc. Note the LIMIT in each subquery and on the outside. If you need OFFSET, too, it gets trickier.




回答2:


You have a pretty long list of zip codes that you are comparing. MySQL has an optimization that affects why the execution time without the order by is a bit different. With a list of constants, MySQL sorts the list and does a binary search. I could see this explaining the last results.

With the order by, I am not sure. The actual execution might be affected by other things running on the server. Do you know if anything else is running?




回答3:


The MYSQL has an optimization , your select when you use OR the number of comparisons increases.



来源:https://stackoverflow.com/questions/32256402/sql-performance-using-or-is-slower-than-in-when-using-order-by

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!