I\'m working on a WordPress plugin that replaces the bad words from the comments with random new ones from a list.
I now have 2 arrays: one containing the bad words
There are (as has been pointed out in the comments numerous times) gaping wholes for you - and/or your code - to fall into through implementing such a feature, to name but a few:
You'd do better to implement a moderation/flagging system where people can flag offensive comments which can then be edited/removed by mods, users, etc.
On that understanding, let us proceed...
Given that you:
$bad_words$good_wordsYou can very easily use PHPs preg_replace_callback function:
$input_string = 'This Could be interesting but should it be? Perhaps this \'would\' work; or couldn\'t it?';
$bad_words = array('could', 'would', 'should');
$good_words = array('might', 'will');
function replace_words($matches){
global $good_words;
return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
}
echo preg_replace_callback('/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i', 'replace_words', $input_string);
Okay, so what the preg_replace_callback does is it compiles a regex pattern consisting of all of the bad words. Matches will then be in the format:
/(START OR WORD_BOUNDARY OR WHITE_SPACE)(BAD_WORD)(WORD_BOUNDARY OR WHITE_SPACE OR END)/i
The i modifier makes it case insensitive so both bad and Bad would match.
The function replace_words then takes the matched word and it's boundaries (either blank or a white space character) and replaces it with the boundaries and a random good word.
global $good_words; <-- Makes the $good_words variable accessible from within the function
$matches[1] <-- The word boundary before the matched word
$matches[3] <-- The word boundary after the matched word
$good_words[rand(0, count($good_words)-1] <-- Selects a random good word from $good_words
You could rewrite the above as a one liner using an anonymous function in the preg_replace_callback
echo preg_replace_callback(
'/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i',
function ($matches) use ($good_words){
return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
},
$input_string
);
If you're going to use it multiple times you may also write it as a self-contained function, although in this case you're most likely going to want to feed the good/bad words in to the function when calling it (or hard code them in there permanently) but that depends on how you derive them...
function clean_string($input_string, $bad_words, $good_words){
return preg_replace_callback(
'/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i',
function ($matches) use ($good_words){
return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
},
$input_string
);
}
echo clean_string($input_string, $bad_words, $good_words);
Running the above functions consecutively with the input and word lists shown in the first example:
This will be interesting but might it be? Perhaps this 'will' work; or couldn't it?
This might be interesting but might it be? Perhaps this 'might' work; or couldn't it?
This might be interesting but will it be? Perhaps this 'will' work; or couldn't it?
Of course the replacement words are chosen randomly so if I refreshed the page I'd get something else... But this shows what does/doesn't get replaced.
$bad_wordsforeach($bad_words as $key=>$word){
$bad_words[$key] = preg_quote($word);
}
\bIn this code I've used \b, \s, and ^ or $ as word boundaries there is a good reason for this. While white space, start of string, and end of string are all considered word boundaries \b will not match in all cases, for example:
\b\$h1t\b <---Will not match
This is because \b matches against non-word characters (i.e. [^a-zA-Z0-9]) and characters like $ don't count as word characters.
Depending on the size of your word list there are a couple of potential hiccups. From a system design perspective it's generally bad form to have huge regexes for a couple of reasons:
Given that the regex pattern is compiled by PHP the first reason is negated. The second should be negated as well; if you're word list is large with a dozen permutations of each bad word then I suggest you stop and rethink your approach (read: use a flagging/moderation system).
To clarify, I don't see a problem have a small word list to filter out specific expletives as it serves a purpose: to stop users from having an outburst at one another; the problem comes when you try to filter out too much including permutations. Stick to filtering common swear words and if that doesn't work then - for the last time - implement a flagging/moderation system.