I apologise as I have very little knowledge about Regex and I don\'t even understand exactly what this regex is doing (I didn\'t write it - source) apart from the fact it se
George, resurrecting this ancient question because it had a simple solution that wasn't mentioned. This situation is straight out of my pet question of the moment, Match (or replace) a pattern except in situations s1, s2, s3 etc
You want to modify the following regex to exclude anything between <script> and </script>:
(\bSOMETERM|SOMETERM\b)(?!([^<]+)?>)
Please forgive me for switching out $term with SOMETERM, it is for clarity because $ has a special meaning in regex.
With all the disclaimers about matching html in regex, to exclude anything between <script> and </script>, you can simply add this to the beginning of your regex:
<script>.*?</script>(*SKIP)(*F)|
so the regex becomes:
<script>.*?</script>(*SKIP)(*F)|(\bSOMETERM|SOMETERM\b)(?!([^<]+)?>)
How does this work?
The left side of the OR (i.e., |) matches complete <script...</script> tags, then deliberately fails. The right side matches what you were matching before, and we know it is the right stuff because if it was between script tags, it would have failed.
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
You mentioned in a comment that it would be acceptable to remove script tags before performing the search.
$data = preg_replace('/<\s*script.*?\/script\s*>/iu', '', $data);
This code may help with that.
The most accurate approach is to:
<script> tags.You don't want to try parsing HTML with regular expressions. Here's an explanation of why: http://htmlparsing.com/regexes.html
It will make you sad in the long run. Please take a look at the rest of http://htmlparsing.com/ for some pointers that could get you started.
Since lookbehind assertions need to be fixed in length, you cannot use them to look for a preceding <script> tag somewhere before the searched term.
So, after you replace all the occurrences of the desired term, you need a second pass to revert back those occurrences of the modified term that appear to be inside a <script> tag.
# provide some sample data
$excerpt = 'My name is bob!
And bob is cool.
<script type="text/javascript">
var bobby = "It works fine even if you already have tagged the term <em>bob</em> inside the script tag.";
alert(bobby);
var bob = 5;
</script>
Yeah, the word "bob" works fine.';
$start_emp_token = '<em>';
$end_emp_token = '</em>';
$pr_term = 'bob';
# replace everything (not in a tag)
$excerpt = preg_replace("/(\b$pr_term|$pr_term\b)(?!([^<]+)?>)/iu", $start_emp_token . '$1' . $end_emp_token, $excerpt);
# undo some of the replacements
$excerpt = preg_replace_callback('#(<script(?:[^>]*)>)(.*?)(</script>)#is',
create_function(
'$matches',
'global $start_emp_token, $end_emp_token, $pr_term;
return $matches[1].str_replace("$start_emp_token$pr_term$end_emp_token", "$pr_term", $matches[2]).$matches[3];'
),
$excerpt);
var_dump($excerpt);
The code above produces the following output:
string(271) "My name is <em>bob</em>!
And <em>bob</em> is cool.
<script type="text/javascript">
var bobby = "It works fine even if you already have tagged the term <em>bob</em> inside the script tag.";
alert(bobby);
var bob = 5;
</script>
Yeah, the word "<em>bob</em>" works fine."