Fuzzy Text Search: Regex Wildcard Search Generator?

冷暖自知 提交于 2020-01-31 18:13:07

问题


I'm wondering if there is some kind of way to do fuzzy string matching in PHP. Looking for a word in a long string, finding a potential match even if its mis-spelled; something that would find it if it was off by one character due to an OCR error.

I was thinking a regex generator might be able to do it. So given an input of "crazy" it would generate this regex:

.*((crazy)|(.+razy)|(c.+azy)|cr.+zy)|(cra.+y)|(craz.+)).*

It would then return all matches for that word or variations of that word.

How to build the generator: I would probably split the search string/word up into an array of characters and build the regex expression doing a foreach the newly created array replacing the key value (the position of the letter in the string) with ".+".

Is this a good way to do fuzzy text search or is there a better way? What about some kind of string comparison that gives me a score based on how close it is? I'm trying to see if some badly converted OCR text contains a word in short.


回答1:


String distance functions are useless when you don't know what the right word is. I'd suggest pspell functions:

$p = pspell_new("en");
print_r(pspell_suggest($p, "crazzy"));

http://www.php.net/manual/en/function.pspell-suggest.php




回答2:


echo generateRegex("crazy");
function generateRegex($word)
{
  $len = strlen($word);
  $regex = "\b((".$word.")";
  for($i = 0; $i < $len; $i++)
  {
    $temp = $word;
    $temp[i] = '.';
    $regex .= "|(".$temp.")";
  }
  $regex = $regex.")\b";
  return $regex;
}



回答3:


Levenshtein is one example of a String Edit-distance. There are different metrics for different purposes. Familiarize yourself with them and find the one that works for you.



来源:https://stackoverflow.com/questions/1720660/fuzzy-text-search-regex-wildcard-search-generator

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!