Regular Expression Negative Lookahead/Lookbehind to Exclude HTML from Find-and-Replace

问题

I have a feature on my site where search results have the search query highlighted in results. However, some of the fields that the site searched through has HTML in it. For example, let's say I had a search result consisting of Hello all. If the user searched for the letter a, I want the code to return Hello aall instead of the messy <span>Hello all</span> that it would return now.

I know that I can use negative lookbehinds and lookaheads in preg_replace() to exclude any instances where the a is between a < and >. But how do I do that? Regular expressions are one of my weaknesses and I can't seem to come up with any that work.

So far, what I've got is this:

$return = preg_replace("/(?<!\<[a-z\s]+?)$match(?!\>[a-z\s]+?)/i", '<mark>'.$match.'</mark>', $result);

But it doesn't seem to work. Any help?

回答1:

It's considered bad practice to use regex to parse a complex language like HTML. With sufficient skill and patience, and an advanced regex engine, it may be possible, but the potential pitfalls are huge and the performance is unlikely to be good.

A better solution is to use a dom parser such as PHP's built-in DOMDocument class.

A good example of this can be found here in the answer to this related SO question.

Hope that helps.

回答2:

If you do want to use regular expressions, a simple negative look-ahead is all that is required (assuming well-formed markup with no < or > within or between the tags)

$return = preg_replace("/$match(?![^<>]*>)/i", '<mark>$0</mark>', $result);

Any special regular expression characters in $match will need to be properly escaped.

来源：https://stackoverflow.com/questions/15526781/regular-expression-negative-lookahead-lookbehind-to-exclude-html-from-find-and-r

标签

php

html

regex

html-parsing

preg-replace