Matching for case insensitive exact phrase with spaces

If I have a string "Hello I went to the store today" and I had an array of matches

$perfectMatches = array("i went","store today");

It should match both of those. (the array can get quite large so i'd prefer to do it in 1 preg_match)

Edit: Got this one working! thanks!

preg_match_all("/\b(" . implode($perfectMatches,"|") . ")\b/i", $string, $match1)

I also need a separate regular expression that is kind of hard to explain. Say I have an array

$array = array("birthday party","ice cream");//this can be very long

Is it possible to get a regular expression that will match a string if "birthday" and "party" and anywhere in the string?

So it should match "Hi, it's my birthday and I'm going to have a party"? But with "ice cream" also in 1 preg_match?

Thanks

Edit: Example...

A user submits a description of an item and I want to check for spam. I know that most spam posts have phrases like "personal checks" or "hot deal" so I want to get a list of all these phrases and check it with the description. If the description has any of the phrases in my list, it'll be marked as spam. This scenario applies to the first regular expression I want.

The second regular expression would be if I knew that some spam posts have the words "lose" "weight" "fast" somewhere in there, doesn't have to be in any order, but those 3 words are in the description. So if I get a list of these phrases "lose weight fast","credit card required" and check it with the description, I can mark it as spam

It sounds like part 1 of your problem is already solved, so this answer focuses only on part 2. As I understand it, you are trying to determine if a given input message contains all of a list of words in any order.

This can be done with a regex and a single preg_match for each message, but it is very inefficient if you have a large list of words. If N is the number of words you are searching for and M is the length of the message, then the algorithm should be O(N*M). If you notice, there are two .* terms in the regex for each keyword. With the lookahead assertions, the regex engine has to traverse once for each keyword. Here is the example code:

<?php

// sample messages
$msg1 = "Lose all the weight all the weight you want.  It's fast and easy!";
$msg2 = 'Are you over weight? lose the pounds fast!';
$msg3 = 'Lose weight slowly by working really hard!';

// spam defining keywords (all required, but any order).
$keywords = array('lose', 'weight', 'fast');

//build the regex pattern using the array of keywords
$patt = '/(?=.*\b'. implode($keywords, '\b.*)(?=.*\b') . '\b.*)/is';

echo "The pattern is: '" .$patt. "'\n";
echo 'msg1 '. (preg_match($patt, $msg1) ? 'is' : 'is not') ." spam\n";
echo 'msg2 '. (preg_match($patt, $msg2) ? 'is' : 'is not') ." spam\n";
echo 'msg3 '. (preg_match($patt, $msg3) ? 'is' : 'is not') ." spam\n";
?>

The output is:

The pattern is: '/(?=.*\blose\b.*)(?=.*\bweight\b.*)(?=.*\bfast\b.*)/is'
msg1 is spam
msg2 is spam
msg3 is not spam

This second solution seems more complex because there is more code, but the regex is much simpler. It has no lookahead assertions and no .* terms. The preg_match function is called in a while loop, but this is not really a big deal. Each message is traversed only once and the complexity should be O(M). This could also be done with a single preg_match_all function, but then you would have to perform an array_search to get the final count.

<?php

// sample messages
$msg1 = "Lose all the weight all the weight you want.  It's fast and easy!";
$msg2 = 'Are you over weight? lose the pounds fast!';
$msg3 = 'Lose weight slowly by working really hard!';

// spam defining keywords (all required, but any order).
$keywords = array('lose', 'weight', 'fast');

//build the regex pattern using the array of keywords
$patt = '/(\b'. implode($keywords,'\b|\b') .'\b)/is';

echo "The pattern is: '" .$patt. "'\n";
echo 'msg1 '. (matchall($patt, $msg1, $keywords) ? 'is' : 'is not') ." spam\n";
echo 'msg2 '. (matchall($patt, $msg2, $keywords) ? 'is' : 'is not') ." spam\n";
echo 'msg3 '. (matchall($patt, $msg3, $keywords) ? 'is' : 'is not') ." spam\n";

function matchall($patt, $msg, $keywords)
{
  $offset = 0;
  $matches = array();
  $index = array_fill_keys($keywords, 0);
  while( preg_match($patt, $msg, &$matches, PREG_OFFSET_CAPTURE, $offset) ) {
    $offset = $matches[1][1] + strlen($matches[1][0]);
    $index[strtolower($matches[1][0])] += 1;
  }
  return min($index);
}
?>

The output is:

The pattern is: '/(\blose\b|\bweight\b|\bfast\b)/is'
msg1 is spam
msg2 is spam
msg3 is not spam

来源：https://stackoverflow.com/questions/15124922/matching-for-case-insensitive-exact-phrase-with-spaces

标签

php

regex

preg-match