Matching for case insensitive exact phrase with spaces

て烟熏妆下的殇ゞ 提交于 2019-12-01 20:11:06

It sounds like part 1 of your problem is already solved, so this answer focuses only on part 2. As I understand it, you are trying to determine if a given input message contains all of a list of words in any order.

This can be done with a regex and a single preg_match for each message, but it is very inefficient if you have a large list of words. If N is the number of words you are searching for and M is the length of the message, then the algorithm should be O(N*M). If you notice, there are two .* terms in the regex for each keyword. With the lookahead assertions, the regex engine has to traverse once for each keyword. Here is the example code:

<?php

// sample messages
$msg1 = "Lose all the weight all the weight you want.  It's fast and easy!";
$msg2 = 'Are you over weight? lose the pounds fast!';
$msg3 = 'Lose weight slowly by working really hard!';

// spam defining keywords (all required, but any order).
$keywords = array('lose', 'weight', 'fast');

//build the regex pattern using the array of keywords
$patt = '/(?=.*\b'. implode($keywords, '\b.*)(?=.*\b') . '\b.*)/is';

echo "The pattern is: '" .$patt. "'\n";
echo 'msg1 '. (preg_match($patt, $msg1) ? 'is' : 'is not') ." spam\n";
echo 'msg2 '. (preg_match($patt, $msg2) ? 'is' : 'is not') ." spam\n";
echo 'msg3 '. (preg_match($patt, $msg3) ? 'is' : 'is not') ." spam\n";
?>

The output is:

The pattern is: '/(?=.*\blose\b.*)(?=.*\bweight\b.*)(?=.*\bfast\b.*)/is'
msg1 is spam
msg2 is spam
msg3 is not spam

This second solution seems more complex because there is more code, but the regex is much simpler. It has no lookahead assertions and no .* terms. The preg_match function is called in a while loop, but this is not really a big deal. Each message is traversed only once and the complexity should be O(M). This could also be done with a single preg_match_all function, but then you would have to perform an array_search to get the final count.

<?php

// sample messages
$msg1 = "Lose all the weight all the weight you want.  It's fast and easy!";
$msg2 = 'Are you over weight? lose the pounds fast!';
$msg3 = 'Lose weight slowly by working really hard!';

// spam defining keywords (all required, but any order).
$keywords = array('lose', 'weight', 'fast');

//build the regex pattern using the array of keywords
$patt = '/(\b'. implode($keywords,'\b|\b') .'\b)/is';

echo "The pattern is: '" .$patt. "'\n";
echo 'msg1 '. (matchall($patt, $msg1, $keywords) ? 'is' : 'is not') ." spam\n";
echo 'msg2 '. (matchall($patt, $msg2, $keywords) ? 'is' : 'is not') ." spam\n";
echo 'msg3 '. (matchall($patt, $msg3, $keywords) ? 'is' : 'is not') ." spam\n";

function matchall($patt, $msg, $keywords)
{
  $offset = 0;
  $matches = array();
  $index = array_fill_keys($keywords, 0);
  while( preg_match($patt, $msg, &$matches, PREG_OFFSET_CAPTURE, $offset) ) {
    $offset = $matches[1][1] + strlen($matches[1][0]);
    $index[strtolower($matches[1][0])] += 1;
  }
  return min($index);
}
?>

The output is:

The pattern is: '/(\blose\b|\bweight\b|\bfast\b)/is'
msg1 is spam
msg2 is spam
msg3 is not spam
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!