I am working on this PHP function. The idea is to wrap certain words occuring in a string into certain tags (both, words and tags, given in an array). It works OK!, but when
Definitely use a dom parser to isolate the qualifying text nodes before attempting to replace with a regex pattern that respects: word boundries, case-insensitivity, and unicode characters. If you are planning to specifically target words with unicode characters, then you will need to add mb_ to some of the string functions.
After leveraging the following insights, I tailored a solution for your scenario.
Code: (Demo)
$html = <<fóo lórem
bár ipsum bar food foo bark. bar not á test
HTML;
$lookup = [
'foo' => 'h3',
'bar' => 'h2'
];
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$regexNeedles = [];
foreach ($lookup as $word => $tagName) {
$regexNeedles[] = preg_quote($word, '~');
}
$pattern = '~\b(' . implode('|', $regexNeedles) . ')\b~iu' ;
foreach($xpath->query('//*[not(self::a)]/text()') as $textNode) {
$newNodes = [];
$hasReplacement = false;
foreach (preg_split($pattern, $textNode->nodeValue, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE) as $fragment) {
$fragmentLower = strtolower($fragment);
if (isset($lookup[$fragmentLower])) {
$hasReplacement = true;
$a = $dom->createElement($lookup[$fragmentLower]);
$a->nodeValue = $fragment;
$newNodes[] = $a;
} else {
$newNodes[] = $dom->createTextNode($fragment);
}
}
if ($hasReplacement) {
$newFragment = $dom->createDocumentFragment();
foreach ($newNodes as $newNode) {
$newFragment->appendChild($newNode);
}
$textNode->parentNode->replaceChild($newFragment, $textNode);
}
}
echo substr(trim(utf8_decode($dom->saveHTML($dom->documentElement))), 3, -4);
Output:
foo
fóo lórem
bár ipsum bar
food foo
bark. bar not á test