How to find strings which are similar to given string in SQL server?

本秂侑毒 提交于 2019-12-12 17:16:21

问题


I have a SQL server table which contains several string columns. I need to write an application which gets a string and search for similar strings in SQL server table.

For example, if I give the "مختار" or "مختر" as input string, I should get these from SQL table:

1 - مختاری
2 - شهاب مختاری
3 - شهاب الدین مختاری

I've searched the net for a solution but I have found nothing useful. I've read this question , but this will not help me because:

  1. I am using MS SQL Server not MySQL
  2. my table contents are in Persian, so I can't use Levenshtein distance and similar methods
  3. I prefer an SQL Server only solution, not an indexing or daemon based solution.

The best solution would be a solution which help us sort result by similarity, but, its optional.

Do you have any suggestion for that?

Thanks


回答1:


Hmm.. considering that you read the other post you probably know about the like operator already... maybe your problem is "getting the string and searching for something similar"?

--This part searches for a string you want

declare @MyString varchar(max)

set @MyString = (Select column from table
where **LOGIC TO FIND THE STRING GOES HERE**)


--This part searches for that string

select searchColumn, ABS(Len(searchColumn) - Len(@MyString)) as Similarity
from table where data LIKE '%' + @MyString + '%'
Order by Similarity, searchColumn

The similarity part is something like the thing you posted. If the strings are "more similar" meaning that they have a similar length, they will be higher on the results query. The absolute part can be avoided obviously but I did it just in case.

Hope that helps =-)




回答2:


MSSQL supports LIKE which seems like it should work. Is there a reason it's not suitable for your program?

SELECT * FROM table WHERE input LIKE '%مختار%'



回答3:


Besides like operator, you can use the condition WHERE instr(columnname, search) > 0; however this is generally slower. What it does is return the starting position of a string within another string. thus if searching in ABCDEFG for CD it would return 3. 3>0, so the record would be returned. However in the case you've described, like seems to be the best solution.




回答4:


The general problem is that in languages where the same letter has different writing form in the beginning, middle and at the end of word, and thus - different codes - we can try to use specific Persian collations, but in general this will not help.

The second option - is to use SQL FTS abilities, but again - if it has not special language module for the language - it is much less useful.

And most general way - to use your own language processing - which is very complex task at all. The next keywords and google can help to understand the size of the problem: DLP, words and terms, bi-gramms, n-gramms, grammar and morphology inflection




回答5:


Try to use the Built-in Soundex() And Difference() functions. I hope they work fine for Persian.

Look at the following reference: http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/

Similarity() function helps you to sort result by similarity (as you asked in your question) and it is also possible using algorithms different from Levenshtein edit distance depends on the Value for @method Algorithm:

0 The Levenshtein edit distance algorithm

1 The Jaccard similarity coefficient algorithm

2 A form of the Jaro-Winkler distance algorithm

3 Longest common subsequence algorithm



来源:https://stackoverflow.com/questions/8636911/how-to-find-strings-which-are-similar-to-given-string-in-sql-server

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!