levenshtein-distance

PHP - Finding number of matching words between two pieces of text?

我的未来我决定 提交于 2019-12-02 11:21:05
I want to find number of similar words between two texts Example $str1=the cat is on the roof $str2=the mouse is on the roof the,is,on,the,roof words are similar in $str1 and $str2 So output will be in number 5 OR In percentage 86% I am try similar_text() function but this function not work as which i want. Easy, explode them and then use array_diff: $totalWords = count($array_1); $array_1 = explode(" ", $str1); $array_2 = explode(" ", $str2); $differenceCount = count(array_diff($array_1, $array_2)); $differentPercent = $differenceCount / ($totalWords / 100); @Edit: Edited code above to

Find near-duplicates of comma-separated lists using Levenshtein distance [duplicate]

时光毁灭记忆、已成空白 提交于 2019-12-02 11:03:25
This question already has an answer here: Potential Duplicates Detection, with 3 Severity Level 1 answer This question based on the answer of my question yesterday. To solve my problem, Jean-François Corbett suggested a Levenshtein distance approach. Then I found this code somewhere to get Levenshtein distance percentage. Public Function GetLevenshteinPercentMatch( _ ByVal string1 As String, ByVal string2 As String, _ Optional Normalised As Boolean = False) As Single Dim iLen As Integer If Normalised = False Then string1 = UCase$(WorksheetFunction.Trim(string1)) string2 = UCase$

Modifying Levenshtein Distance for positional Bias

℡╲_俬逩灬. 提交于 2019-12-02 02:07:36
I am using the Levenshtein distance algorithm to compare a company name provided as a user input against a database of known company names to find closest match. By itself, the algorithm works okay, but I want to build in a Bias so that the edit distance is considered lower if the initial parts of the strings match. For Example, if the search criteria is "ABCD", then both "ABCD Co." and "XYX ABCD" have identical Edit Distance. However I want to add weight to the fact that the initial parts of the first string matches the search criteria more closely than the second string. One way of doing

Sphinx and “did you mean … ?” suggestions idea. WIll it work?

落花浮王杯 提交于 2019-12-01 17:47:35
I'm trying to come up with the fastest way to make search suggestions. At first I thought a Levenstein UDF function combined with a mysql table would do the job. But using levenshtein, mysql would have to go over every row in the table (tons of words) which would make the query really slow. Now I recently installed and started to use Sphinx (http://sphinxsearch.com/) for fulltext searching mainly because of its performance and tight mysql integration with SphinxSE. So I asked myself if I can implement a "did you mean" algorithm using sphinx to boost performance somehow, and I think I found a

mySQL - matching latin (english) form input to utf8 (non-English) data

孤者浪人 提交于 2019-12-01 12:54:09
I maintain a music database in mySQL, how do I return results stored under e.g. 'Tiësto' when people search for 'Tiesto'? All the data is stored under full text indexing, if that makes any difference. I'm already employing a combination of Levenshtein in PHP and REGEXP in SQL - not in trying to solve this problem, but just for increased searchability in general. PHP: function Levenshtein($word) { $words = array(); for ($i = 0; $i < strlen($word); $i++) { $words[] = substr($word, 0, $i) . '_' . substr($word, $i); $words[] = substr($word, 0, $i) . substr($word, $i + 1); $words[] = substr($word,

Damerau-Levenshtein distance Implementation

£可爱£侵袭症+ 提交于 2019-12-01 10:29:18
I'm trying to create a damerau-levenshtein distance function in JS. I've found a description off the algorithm on WIkipedia, but they is no implementation off it. It says: To devise a proper algorithm to calculate unrestricted Damerau–Levenshtein distance note that there always exists an optimal sequence of edit operations, where once-transposed letters are never modified afterwards. Thus, we need to consider only two symmetric ways of modifying a substring more than once: (1) transpose letters and insert an arbitrary number of characters between them, or (2) delete a sequence of characters

levenshtein alternative

坚强是说给别人听的谎言 提交于 2019-12-01 03:07:42
问题 i have a big set of queries and use levenshtein to calculate typos, now levenshtein causes mysql to take full cpu time. My query is a fulltext search + levenshtein in a UNION statement. sql1 is my current query, sql2 is only fulltext search which is fast and doesnt use too much cpu time, the last one the leventhein one which will peak! Any of you have an alternative way to get typos as well? Please don't answer normalize data, I have thought of that, but is not applicable to my data, as I

How to determine character similarity?

ぃ、小莉子 提交于 2019-11-30 23:45:18
I am using the Levenshtein distance to find similar strings after OCR. However, for some strings the edit distance is the same, although the visual appearance is obviously different. For example the string Co will return these matches: CY (1) CZ (1) Ca (1) Considering, that Co is the result from an OCR engine, Ca would be the more likely match than the ones. Therefore, after calculating the Levenshtein distance, I'd like to refine query result by ordering by visual similarity. In order to calculate this similarity a I'd like to use standard sans-serif font, like Arial. Is there a library I can

Damerau–Levenshtein distance algorithm in MySQL as a function

☆樱花仙子☆ 提交于 2019-11-30 19:59:35
问题 Does anyone know of a MySQL implementation of the Damerau–Levenshtein distance algorithm as a stored procedure/function that takes a single specified string as a parameter and looks for fuzzy matches of the string in a particular field within a particular table? I have found various procedure/function code examples that compares two specified strings and works out the distance, but firstly this is only the Levenshtein distance algorithm, and not the Damerau–Levenshtein one, and secondly, I'm

How to determine character similarity?

谁说我不能喝 提交于 2019-11-30 18:29:14
问题 I am using the Levenshtein distance to find similar strings after OCR. However, for some strings the edit distance is the same, although the visual appearance is obviously different. For example the string Co will return these matches: CY (1) CZ (1) Ca (1) Considering, that Co is the result from an OCR engine, Ca would be the more likely match than the ones. Therefore, after calculating the Levenshtein distance, I'd like to refine query result by ordering by visual similarity. In order to