Regex ignore matches between [removed] tags

后端未结

关注

 4  1356

I apologise as I have very little knowledge about Regex and I don\'t even understand exactly what this regex is doing (I didn\'t write it - source) apart from the fact it se

相关标签:

4条回答

故里飘歌

2020-12-22 12:12
George, resurrecting this ancient question because it had a simple solution that wasn't mentioned. This situation is straight out of my pet question of the moment, Match (or replace) a pattern except in situations s1, s2, s3 etc

You want to modify the following regex to exclude anything between <script> and </script>:
```
(\bSOMETERM|SOMETERM\b)(?!([^<]+)?>)
```
Please forgive me for switching out $term with SOMETERM, it is for clarity because $ has a special meaning in regex.

With all the disclaimers about matching html in regex, to exclude anything between <script> and </script>, you can simply add this to the beginning of your regex:
```
<script>.*?</script>(*SKIP)(*F)|
```
so the regex becomes:
```
<script>.*?</script>(*SKIP)(*F)|(\bSOMETERM|SOMETERM\b)(?!([^<]+)?>)
```
How does this work?

The left side of the OR (i.e., |) matches complete <script...</script> tags, then deliberately fails. The right side matches what you were matching before, and we know it is the right stuff because if it was between script tags, it would have failed.

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...
0 讨论(0)
发布评论:

提交评论
- 加载中...
青春惊慌失措

2020-12-22 12:15
You mentioned in a comment that it would be acceptable to remove script tags before performing the search.
```
$data = preg_replace('/<\s*script.*?\/script\s*>/iu', '', $data);
```
This code may help with that.
0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2020-12-22 12:16
The most accurate approach is to:
- Parse the HTML with a proper HTML parser
- Ignore the strings that are within the <script> tags.
You don't want to try parsing HTML with regular expressions. Here's an explanation of why: http://htmlparsing.com/regexes.html

It will make you sad in the long run. Please take a look at the rest of http://htmlparsing.com/ for some pointers that could get you started.
0 讨论(0)
发布评论:

提交评论
- 加载中...

忘了有多久

2020-12-22 12:21

Since lookbehind assertions need to be fixed in length, you cannot use them to look for a preceding <script> tag somewhere before the searched term.

So, after you replace all the occurrences of the desired term, you need a second pass to revert back those occurrences of the modified term that appear to be inside a <script> tag.

# provide some sample data
$excerpt = 'My name is bob!

And bob is cool.

<script type="text/javascript">
var bobby = "It works fine even if you already have tagged the term <em>bob</em> inside the script tag.";
alert(bobby);

var bob = 5;
</script>

Yeah, the word "bob" works fine.';

$start_emp_token = '<em>';
$end_emp_token = '</em>';
$pr_term = 'bob';

# replace everything (not in a tag)
$excerpt = preg_replace("/(\b$pr_term|$pr_term\b)(?!([^<]+)?>)/iu", $start_emp_token . '$1' . $end_emp_token, $excerpt);

# undo some of the replacements
$excerpt = preg_replace_callback('#(<script(?:[^>]*)>)(.*?)(</script>)#is',
                       create_function(
                         '$matches',
                         'global $start_emp_token, $end_emp_token, $pr_term;
                          return $matches[1].str_replace("$start_emp_token$pr_term$end_emp_token", "$pr_term", $matches[2]).$matches[3];'
                       ),
                       $excerpt);

var_dump($excerpt);

The code above produces the following output:

string(271) "My name is <em>bob</em>!

And <em>bob</em> is cool.

<script type="text/javascript">
var bobby = "It works fine even if you already have tagged the term <em>bob</em> inside the script tag.";
alert(bobby);

var bob = 5;
</script>

Yeah, the word "<em>bob</em>" works fine."

0 讨论(0)