Is it faster to search for a large string in a DB by its hashcode?

前端 未结 10 2480
余生分开走
余生分开走 2020-12-17 19:32

If I need to retrieve a large string from a DB, Is it faster to search for it using the string itself or would I gain by hashing the string and storing the hash in the DB as

相关标签:
10条回答
  • 2020-12-17 20:19

    Are you doing an equality match, or a containment match? For an equality match, you should let the db handle this (but add a non-clustered index) and just test via WHERE table.Foo = @foo. For a containment match, you should perhaps look at full text index.

    0 讨论(0)
  • 2020-12-17 20:20

    The 'ideal' answer is definitely yes. String matching against an indexed column will always be slower than matching a hashvalue stored in an index column. This is what hashvalues are designed for, because they take a large dataset (e.g. 3000 comparison points, one per character) and coalesce it into a smaller dataset, (e.g. 16 comparison points, one per byte).

    So, the most optimized string comparison tool will be slower than the optimized hash value comparison.

    However, as has been noted, implementing your own optimized hash function is dangerous and likely to not go well. (I've tried and failed miserably) Hash collisions are not particulrly a problem, because then you will just have to fall back on the string matching algorithm, which means that would be (at worst) exactly as fast as your string comparison method.

    But, this is all assuming that your hashing is done in an optimal fashion, (which it probably won't be) and that there will not be any bugs in your hashing component (which there will be) and that the performance increase will be worth the effort (probably not). String comparison algorithms, especially in indexed columns are already pretty fast, and the hashing effort (programmer time) is likely to be much higher than your possible gain.

    And if you want to know about performance, Just Measure It.

    0 讨论(0)
  • 2020-12-17 20:21

    First - MEASURE it. That is the only way to tell for sure.
    Second - If you don't have an issue with the speed of the string searching, then keep it simple and don't use a Hash.

    However, for your actual question (and just because it is an interesting thought). It depends on how similar the strings are. Remember that the DB engine doesn't need to compare all the characters in a string, only enough to find a difference. If you are looking through 10 million strings that all start with the same 300 characters then the hash will almost certainly be faster. If however you are looking for the only string that starts with an x, then i the string comparison could be faster. I think though that SQL will still have to get the entire string from disc, even if it then only uses the first byte (or first few bytes for multi byte characters), so the total string length will still have an impact.

    If you are trying the hash comparison then you should make the hash an indexed calculated column. It will not be faster if you are working out the hashes for all the strings each time you run a query!

    You could also consider using SQL's CRC function. It produces an int, which will be even quicker to comapre and is faster to calculate. But you will have to double check the results of this query by actually testing the string values because the CRC function is not designed for this sort of usage and is much more likly to return duplicate values. You will need to do the CRC or Hash check in one query, then have an outer query that compares the strings. You will also want to watch the QEP generated to make sure the optimiser is processing the query in the order you intended. It might decide to do the string comparisons first, then the CRC or Hash checks second.

    As someone else has pointed out, this is only any good if you are doing an exact match. A hash can't help if you are trying to do any sort of range or partial match.

    0 讨论(0)
  • 2020-12-17 20:23

    If you use a fixed length field and an index it will probably be faster...

    0 讨论(0)
提交回复
热议问题