问题
I have a table with 5-10 million records which has 2 fields
example data
Row Field1 Field2
------------------
1 0712334 072342344
2 06344534 083453454
3 06344534 0845645565
Given 2 variables
variable1 : 0634453445645
variable2 : 08345345456756
I need to be able to query the table for best matches as fast as possible
The above example would produce 1 record (e.g row 2)
What would be the fastest way to query the database for matches?
Note : the data and variables are always in this format (i.e always a number, may or may not have a leading zero, and fields are not unique however the combination of both will be )
My initial thought was to do something like this
Select blah where Field1 + "%" like variable1 and Field2 + "%" like variable2
Please forgive my pseudo code if its not correct, as this is more a fact finding mission however i think i'm in the ball park
Note : I don't think any indexing can help here, though a memory based table i'm guessing would speed this up
Can anyone think of a better way of solving the problem? Any suggestions or comments would be greatly appreciated, thanks
回答1:
You can get a plan with a seek on an index on Field1
with query like this.
declare @V1 varchar(20) = '0634453445645'
declare @V2 varchar(20) = '08345345456756'
select Field1,
Field2
from YourTable
where Field1 like left(@V1, 4) + '%' and
@V1 like Field1 + '%' and
@V2 like Field2 + '%'
It does a range seek on the first four characters on Field1
and uses the full comparison on Field1
and Field2
in a residual predicate.

回答2:
There is no performance tip. SImple like that.
%somethin% is table scan, Indices are not used due to the beginning %. Ful ltext indexing won't work as it is not a full text you seek but part of a word.
Getting a faster machine to handle the table scans and denormalizing is the only thing you can do. 5-10 million rows should be faste enough on a decent computer. Memory based table is not needed - just enough RAM to cache that table.
And that pretty much is it. Either find a way to get rid of the initial % or get hardware (mostly memory) fast enough to handle this.
OR - handle it OUTSIDE sql server. Load the 5-10 million rows into a search service and use a better data structure. SQL being generic has to make compromises. But again, the partial match will kill pretty much most approaches.
回答3:
Postgres has trigram indexes http://www.postgresql.org/docs/current/interactive/pgtrgm.html
Maybe SQL Server has something like that?
回答4:
What is the shortest length in Column 'Field1' and 'Field2'? Call this number 'N'.
Then create a select statement which asks for all substrings starting at the first character of length N to the length of each variable. Example (say, N=10)
select distinct * from myTable
where Field1 in ('0634453445','06344534456','063445344564', '0634453445645')
and Field2 in ('0834534545','08345345456','083453454567', '0834534545675','08345345456756')
Write a small script which creates the query for you. Of course there is much more to optimize but this requires (imho) changes in the structure of your table and I can imagine that this is something you don't want. At least you can give it a fast try.
Also, you should include the query plan when you try this approach in SSMS. The query plan will give you a nice hint in how to organize your index.
来源:https://stackoverflow.com/questions/22165269/sql-server-performance-tips-for-like