Javascript fuzzy search that makes sense

江枫思渺然 提交于 2019-11-26 21:24:19

Good question! But my thought is that, rather than trying to modify Levenshtein-Demerau, you might be better to try a different algorithm or combine/ weight the results from two algorithms.

It strikes me that exact or close matches to the "starting prefix" are something Levenshtein-Demerau gives no particular weight to -- but your apparent user expectations would.

I searched for "better than Levenshtein" and, among other things, found this:

http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/

This mentions a number of "string distance" measures. Three which looked particularly relevant to your requirement, would be:

  1. Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.

  2. q-gram distance: Sum of absolute differences between N-gram vectors of both strings.

  3. Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.

Maybe you could use a weighted combination (or minimum) of these metrics, with Levenshtein -- common substring, common N-gram or Jaccard will all strongly prefer similar strings -- or perhaps try just using Jaccard?

Depending on the size of your list/ database, these algorithms can be moderately expensive. For a fuzzy search I implemented, I used a configurable number of N-grams as "retrieval keys" from the DB then ran the expensive string-distance measure to sort them in preference order.

I wrote some notes on Fuzzy String Search in SQL. See:

I tried using existing fuzzy libraries like fuse.js and also found them to be terrible, so I wrote one which behaves basically like sublime's search. https://github.com/farzher/fuzzysort

The only typo it allows is a transpose. It's pretty solid (1k stars, 0 issues), very fast, and handles your case easily:

fuzzysort.go('int', ['international', 'splint', 'tinder']) // [{highlighted: '*int*ernational', score: 10}, {highlighted: 'spl*int*', socre: 3003}] 

Here is a technique I have used a few times...It gives pretty good results. Does not do everything you asked for though. Also, this can be expensive if the list is massive.

get_bigrams = (string) ->     s = string.toLowerCase()     v = new Array(s.length - 1)     for i in [0..v.length] by 1         v[i] = s.slice(i, i + 2)     return v  string_similarity = (str1, str2) ->     if str1.length > 0 and str2.length > 0         pairs1 = get_bigrams(str1)         pairs2 = get_bigrams(str2)         union = pairs1.length + pairs2.length         hit_count = 0         for x in pairs1             for y in pairs2                 if x is y                     hit_count++         if hit_count > 0             return ((2.0 * hit_count) / union)     return 0.0 

Pass two strings to string_similarity which will return a number between 0 and 1.0 depending on how similar they are. This example uses Lo-Dash

Usage Example....

query = 'jenny Jackson' names = ['John Jackson', 'Jack Johnson', 'Jerry Smith', 'Jenny Smith']  results = [] for name in names     relevance = string_similarity(query, name)     obj = {name: name, relevance: relevance}     results.push(obj)  results = _.first(_.sortBy(results, 'relevance').reverse(), 10)  console.log results 

Also....have a fiddle

Make sure your console is open or you wont see anything :)

you may take a look at Atom's https://github.com/atom/fuzzaldrin/ lib.

it is available on npm, has simple API, and worked ok for me.

> fuzzaldrin.filter(['international', 'splint', 'tinder'], 'int'); < ["international", "splint"] 
(function (int) {     $("input[id=input]")         .on("input", {         sort: int     }, function (e) {         $.each(e.data.sort, function (index, value) {           if ( value.indexOf($(e.target).val()) != -1                && value.charAt(0) === $(e.target).val().charAt(0)                && $(e.target).val().length === 3 ) {                 $("output[for=input]").val(value);           };           return false         });         return false     }); }(["international", "splint", "tinder"])) 

jsfiddle http://jsfiddle.net/guest271314/QP7z5/

this is my short and compact function for fuzzy match:

function fuzzyMatch(pattern, str) {   pattern = '.*' + pattern.split('').join('.*') + '.*';   const re = new RegExp(pattern);   return re.test(str); } 

Check out my Google Sheets add-on called Flookup and use this function:

Flookup (lookupValue, tableArray, lookupCol, indexNum, threshold, [rank]) 

The parameter details are:

  1. lookupValue: the value you're looking up
  2. tableArray: the table you want to search
  3. lookupCol: the column you want to search
  4. indexNum: the column you want data to be returned from
  5. threshold: the percentage similarity below which data shouldn't be returned
  6. rank: the nth best match (i.e. if the first match isn't to your liking)

This should do satisfy your requirements... although I'm not sure about point number 2.

Find out more at the official website.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!