I am working on this PHP function. The idea is to wrap certain words occuring in a string into certain tags (both, words and tags, given in an array). It works OK!, but when
Definitely use a dom parser to isolate the qualifying text nodes before attempting to replace with a regex pattern that respects: word boundries, case-insensitivity, and unicode characters. If you are planning to specifically target words with unicode characters, then you will need to add mb_
to some of the string functions.
After leveraging the following insights, I tailored a solution for your scenario.
Code: (Demo)
$html = <<<HTML
foo <a href='http://test.com'>fóo</a> lórem
bár ipsum bar food foo bark. <a>bar</a> not á test
HTML;
$lookup = [
'foo' => 'h3',
'bar' => 'h2'
];
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$regexNeedles = [];
foreach ($lookup as $word => $tagName) {
$regexNeedles[] = preg_quote($word, '~');
}
$pattern = '~\b(' . implode('|', $regexNeedles) . ')\b~iu' ;
foreach($xpath->query('//*[not(self::a)]/text()') as $textNode) {
$newNodes = [];
$hasReplacement = false;
foreach (preg_split($pattern, $textNode->nodeValue, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE) as $fragment) {
$fragmentLower = strtolower($fragment);
if (isset($lookup[$fragmentLower])) {
$hasReplacement = true;
$a = $dom->createElement($lookup[$fragmentLower]);
$a->nodeValue = $fragment;
$newNodes[] = $a;
} else {
$newNodes[] = $dom->createTextNode($fragment);
}
}
if ($hasReplacement) {
$newFragment = $dom->createDocumentFragment();
foreach ($newNodes as $newNode) {
$newFragment->appendChild($newNode);
}
$textNode->parentNode->replaceChild($newFragment, $textNode);
}
}
echo substr(trim(utf8_decode($dom->saveHTML($dom->documentElement))), 3, -4);
Output:
<h3>foo</h3> <a href="http://test.com">fóo</a> lórem
bár ipsum <h2>bar</h2> food <h3>foo</h3> bark. <a>bar</a> not á test
To the answer you pointed, in JS, it's basically the same. You just have to specify it's a string.
$regexp = "/(<pre>(?:[^<](?!\/pre))*<\/pre>)|(\:\-\))/gi";
Also note that you may be need another preg_replace function to replace the word 'empresarios' in case it's capitalized (Empresarios) or like weird stuff (EmPreSAriOS).
Also take care of your HTML. <h2>
are block elements and may be interpretated this way:
string where the word empresarios should be replaced;
And replaced
string where the word
empresarios
should be replaced;
Maybe what you'll need to use is a <big>
tag.
Use the DOM and only modify text nodes:
$s = "foo <a href='http://test.com'>foo</a> lorem bar ipsum foo. <a>bar</a> not a test";
echo htmlentities($s) . '<hr>';
$d = new DOMDocument;
$d->loadHTML($s);
$x = new DOMXPath($d);
$t = $x->evaluate("//text()");
$wrap = array(
'foo' => 'h1',
'bar' => 'h2'
);
$preg_find = '/\b(' . implode('|', array_keys($wrap)) . ')\b/';
foreach($t as $textNode) {
if( $textNode->parentNode->tagName == "a" ) {
continue;
}
$sections = preg_split( $preg_find, $textNode->nodeValue, null, PREG_SPLIT_DELIM_CAPTURE);
$parentNode = $textNode->parentNode;
foreach($sections as $section) {
if( !isset($wrap[$section]) ) {
$parentNode->insertBefore( $d->createTextNode($section), $textNode );
continue;
}
$tagName = $wrap[$section];
$parentNode->insertBefore( $d->createElement( $tagName, $section ), $textNode );
}
$parentNode->removeChild( $textNode );
}
echo htmlentities($d->saveHTML());
Edited to replace DOMText with DOMText and DOMElement as necessary.