Javascript fuzzy search that makes sense

I'm looking for a fuzzy search JavaScript library to filter an array. I've tried using fuzzyset.js and fuse.js, but the results are terrible (there are demos you can try on the linked pages).

After doing some reading on Levenshtein distance, it strikes me as a poor approximation of what users are looking for when they type. For those who don't know, the system calculates how many insertions, deletions, and substitutions are needed to make two strings match.

One obvious flaw, which is fixed in the Levenshtein-Demerau model, is that both blub and boob are considered equally similar to bulb (each requiring two substitutions). It is clear, however, that bulb is more similar to blub than boob is, and the model I just mentioned recognizes that by allowing for transpositions.

I want to use this in the context of text completion, so if I have an array ['international', 'splint', 'tinder'], and my query is int, I think international ought to rank more highly than splint, even though the former has a score (higher=worse) of 10 versus the latter's 3.

So what I'm looking for (and will create if it doesn't exist), is a library that does the following:

Weights the different text manipulations
Weights each manipulation differently depending on where they appear in a word (early manipulations being more costly than late manipulations)
Returns a list of results sorted by relevance

Has anyone come across anything like this? I realize that StackOverflow isn't the place to be asking for software recommendations, but implicit (not anymore!) in the above is: am I thinking about this the right way?

Edit

I found a good paper (pdf) on the subject. Some notes and excerpts:

Affine edit-distance functions assign a relatively lower cost to a sequence of insertions or deletions

the Monger-Elkan distance function (Monge & Elkan 1996), which is an affine variant of the Smith-Waterman distance function (Durban et al. 1998) with particular cost parameters

For the Smith-Waterman distance (wikipedia), "Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure." It's the n-gram approach.

A broadly similar metric, which is not based on an edit-distance model, is the Jaro metric (Jaro 1995; 1989; Winkler 1999). In the record-linkage literature, good results have been obtained using variants of this method, which is based on the number and order of the common characters between two strings.

A variant of this due to Winkler (1999) also uses the length P of the longest common prefix

(seem to be intended primarily for short strings)

For text completion purposes, the Monger-Elkan and Jaro-Winkler approaches seem to make the most sense. Winkler's addition to the Jaro metric effectively weights the beginnings of words more heavily. And the affine aspect of Monger-Elkan means that the necessity to complete a word (which is simply a sequence of additions) won't disfavor it too heavily.

Conclusion:

the TFIDF ranking performed best among several token-based distance metrics, and a tuned affine-gap edit-distance metric proposed by Monge and Elkan performed best among several string edit-distance metrics. A surprisingly good distance metric is a fast heuristic scheme, proposed by Jaro and later extended by Winkler. This works almost as well as the Monge-Elkan scheme, but is an order of magnitude faster. One simple way of combining the TFIDF method and the Jaro-Winkler is to replace the exact token matches used in TFIDF with approximate token matches based on the Jaro- Winkler scheme. This combination performs slightly better than either Jaro-Winkler or TFIDF on average, and occasionally performs much better. It is also close in performance to a learned combination of several of the best metrics considered in this paper.

Good question! But my thought is that, rather than trying to modify Levenshtein-Demerau, you might be better to try a different algorithm or combine/ weight the results from two algorithms.

It strikes me that exact or close matches to the "starting prefix" are something Levenshtein-Demerau gives no particular weight to -- but your apparent user expectations would.

I searched for "better than Levenshtein" and, among other things, found this:

http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/

This mentions a number of "string distance" measures. Three which looked particularly relevant to your requirement, would be:

Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.
q-gram distance: Sum of absolute differences between N-gram vectors of both strings.
Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.

Maybe you could use a weighted combination (or minimum) of these metrics, with Levenshtein -- common substring, common N-gram or Jaccard will all strongly prefer similar strings -- or perhaps try just using Jaccard?

Depending on the size of your list/ database, these algorithms can be moderately expensive. For a fuzzy search I implemented, I used a configurable number of N-grams as "retrieval keys" from the DB then ran the expensive string-distance measure to sort them in preference order.

I wrote some notes on Fuzzy String Search in SQL. See:

http://literatejava.com/sql/fuzzy-string-search-sql/

I tried using existing fuzzy libraries like fuse.js and also found them to be terrible, so I wrote one which behaves basically like sublime's search. https://github.com/farzher/fuzzysort

The only typo it allows is a transpose. It's pretty solid (1k stars, 0 issues), very fast, and handles your case easily:

fuzzysort.go('int', ['international', 'splint', 'tinder']) // [{highlighted: '*int*ernational', score: 10}, {highlighted: 'spl*int*', socre: 3003}]

Here is a technique I have used a few times...It gives pretty good results. Does not do everything you asked for though. Also, this can be expensive if the list is massive.

get_bigrams = (string) ->     s = string.toLowerCase()     v = new Array(s.length - 1)     for i in [0..v.length] by 1         v[i] = s.slice(i, i + 2)     return v  string_similarity = (str1, str2) ->     if str1.length > 0 and str2.length > 0         pairs1 = get_bigrams(str1)         pairs2 = get_bigrams(str2)         union = pairs1.length + pairs2.length         hit_count = 0         for x in pairs1             for y in pairs2                 if x is y                     hit_count++         if hit_count > 0             return ((2.0 * hit_count) / union)     return 0.0

Pass two strings to string_similarity which will return a number between 0 and 1.0 depending on how similar they are. This example uses Lo-Dash

Usage Example....

query = 'jenny Jackson' names = ['John Jackson', 'Jack Johnson', 'Jerry Smith', 'Jenny Smith']  results = [] for name in names     relevance = string_similarity(query, name)     obj = {name: name, relevance: relevance}     results.push(obj)  results = _.first(_.sortBy(results, 'relevance').reverse(), 10)  console.log results

Also....have a fiddle

Make sure your console is open or you wont see anything :)

you may take a look at Atom's https://github.com/atom/fuzzaldrin/ lib.

it is available on npm, has simple API, and worked ok for me.

> fuzzaldrin.filter(['international', 'splint', 'tinder'], 'int'); < ["international", "splint"]

(function (int) {     $("input[id=input]")         .on("input", {         sort: int     }, function (e) {         $.each(e.data.sort, function (index, value) {           if ( value.indexOf($(e.target).val()) != -1                && value.charAt(0) === $(e.target).val().charAt(0)                && $(e.target).val().length === 3 ) {                 $("output[for=input]").val(value);           };           return false         });         return false     }); }(["international", "splint", "tinder"]))

jsfiddle http://jsfiddle.net/guest271314/QP7z5/

this is my short and compact function for fuzzy match:

function fuzzyMatch(pattern, str) {   pattern = '.*' + pattern.split('').join('.*') + '.*';   const re = new RegExp(pattern);   return re.test(str); }

Check out my Google Sheets add-on called Flookup and use this function:

Flookup (lookupValue, tableArray, lookupCol, indexNum, threshold, [rank])

The parameter details are:

lookupValue: the value you're looking up
tableArray: the table you want to search
lookupCol: the column you want to search
indexNum: the column you want data to be returned from
threshold: the percentage similarity below which data shouldn't be returned
rank: the nth best match (i.e. if the first match isn't to your liking)

This should do satisfy your requirements... although I'm not sure about point number 2.

Find out more at the official website.

来源：https://stackoverflow.com/questions/23305000/javascript-fuzzy-search-that-makes-sense

标签

javascript

regex

pattern-matching

string-matching

fuzzy-search