Google fuzzy search (a.k.a “suggestions”): What technique(s) are in use?

自闭症网瘾萝莉.ら 提交于 2019-12-03 03:18:13

问题


I'm implementing search suggestion functionality in my web-app, and have been looking at existing implementations for techniques in use.

It seems as though most of the major sites (Amazon, Bing, etc.) implement fuzzy search in the following way:

Tokenize search string in to terms
processingSearchStringSet = {}
For each term
    if exact term is NOT in index
        Get possible terms (fuzzyTerms) from levenshtein(term, 1 (or 2))
        For each term in fuzzyTerms
            if term is in index
                processingSearchStringSet.intersect(stringsIndexedByTermsSet)
    else
        processingSearchStringSet.intersect(stringsIndexedByTermsSet)

The result set members are then presumably ranked by metrics (ex: term order preserval, absolute term location, search popularity) and preserved or eliminated based on this ranking and a pre-determined result set size before being delivered back to the user.

Google's implementation on the other hand, differs quite a bit from this.

Specifically, it allows more than 1 error in the search string's constituent terms. The error threshhold seems to be dependant on where the term of interest is in the string, though it never exceeds 7.

What's interesting is that:

  1. Conducting a Levenstein search with a threshold of 5 on the entire term space, for each term in the user's string would be insanely expensive
  2. Even if #1 is what is done, it still wouldn't explain the absence of erroneous suggestions

N-grams also don't see to be in use: modifying a term so that it doesn't contain an bigram present in the original term does not seem to affect the result(s).

Here's an example to illustrate my findings:

Example term: "Fiftyyyy shades of grey"

Amazon suggestions: none 
(if the error count exceeds 1 on any term, the search fails)

Bing suggestions: none
(if the error count exceeds 2 on any term, the search fails)

Google suggestions: 10 (max) 
(breaking the search would require 5 or more errors on any single term, 
or multiple errors on multiple terms)

My question is: what type of sorcery is at work here? Are they just using a Levenshtein search with a huge error allowance, or do they use another technique I am unaware of?


回答1:


Maybe you should try this approach: http://norvig.com/spell-correct.html



来源:https://stackoverflow.com/questions/12239236/google-fuzzy-search-a-k-a-suggestions-what-techniques-are-in-use

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!