Performance of string comparison vs int join in SQL

旧街凉风 提交于 2019-11-28 13:28:32

Compared to the other operations being performed, it is unlikely that there is much performance difference between the two approaches. If you have only a handful of colors (up to a few hundred), the color table fits on a single page in most databases. An index on the color would make look up quite fast and not incur any I/O activity (after the first run to load the page).

A string comparison depends on the database, but it does involve a function and reading the data from the page. So, it is not free. Different databases, of course, might have different performance characteristics for a string function.

Where it should be stored should be a function of your application. Say you have an application where the color is going to be presented to the user. You might, one day, want to show the name of the color in Spanish, Swahili, or Chinese. If so, having a separate table makes such internationalization much easier. More prosaicly, you might want to prevent "Grene" from being entered, if so, having such a table makes a selection list easier.

On the other hand, if performance is your only concern, it doesn't make a different. In other cases, it is actually possible for a lookup table to be faster than a denormalized table. This occurs when the strings are long, increasing the length of every record in a larger table. Larger tables mean more pages, which take longer to load into memory.

If I understand correctly, you are asking which of these two queries would be faster:

SELECT * FROM shirt where color = 'Green'

vs

SELECT shirt.* FROM shirt s INNER JOIN colors c 
       ON s.colorid = c.colorid 
       WHERE c.color = 'Green'

It depends a little bit on the database (well ... maybe a lot depending on if it optimizes correctly, which most if not all should), but the lookup in the color table should be negligible and then the remaining execution could use the integer lookup value and should be faster. The bulk of the processing ultimately would be equivalent to SELECT * from shirt WHERE colorid=N. However, I suspect that you would not notice a difference in speed unless the table was quite large. The decision should probably be based on which design makes the most sense (probably the normalized one).

Beyond performance, creating a separate Color table makes your design better normalized. So, some day in the future, when someone decides that "Dark Blue" should now be called "Navy Blue", you update 1 row in your Color table vs. updating many rows in your Shirt table.

The DBMS has the opportunity to optimize indicies where there is a limited number of values. How to tell sQL to do this though, I don't know. It might figure it out.

start a data warehouse if reporting performance is a serious issue..

As Joe points out, you want the database as normalized as possible. If you have a separate reporting function, which has the possibility of causing performance problems, you should run a periodic transform (or put rules in place to do build in realtime) a 2nd read-only schema. The first is OLTP and the 2nd is OLAP ('data warehouse'); these are important concepts to have in place if you are going to get serious about your data.

If it's important enought to know, test it.

If nobody gives you an answer, the best way to do it is to test on your own.

(1) make 2 databases

(2) each with a test of your 2 tables

(3) On database just joins on the string 'color', and uses that for an FK; the other joins by int ('colorID')

Fill each with 2 million dummy rows. Run a multiple queries on each, timing 1st run and average runs.

Use an instance on your dev machine to take the network out of the picture.

You should also start and stop the instance before each type of test; stuff will stay in memory intentionally so SQL can deliver it faster, but likely, this will skew your test results from real-world operation - where it might not be in memory or cached anymore.

Carlos Grappa

It depends on the query optimizer really. Your color table will be very small, so probably based on the database statistics and the query plans it would probably loaded completely in memory, so you not only end up negating the performance cost of the join it may actually be faster. This obviously depends on the dbms you're using, but several dbms can take hints as to treat a table in a special way.

Another +1 for the Color table is that if you need to change the color name, you only need 1 update as opposed to changing the string value for each occurrence.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!