Regex ignore matches between tags

问题

I apologise as I have very little knowledge about Regex and I don't even understand exactly what this regex is doing (I didn't write it - source) apart from the fact it searches for a certain term so that it can be highlighted.

Here is the Regex:

/(\b$term|$term\b)(?!([^<]+)?>)/iu

The problem is I need to make sure it doesn't match anything between <script> and </script> tags. Now I know there are many variations of how a script tag can be written but really all I need it to do is ignore any text between <script and /script> taking into account possible whitespace between script and < like < script or /script >.

Is anyone able to modify it in this way? I will notify the plugin's author who wrote this reg-ex for inclusion in future releases.

Edit: Here is the function it originates from:

function relevanssi_highlight_terms($excerpt, $query) {
    $type = get_option("relevanssi_highlight");
    if ("none" == $type) {
        return $excerpt;
    }

    switch ($type) {
        case "mark":                        // thanks to Jeff Byrnes
            $start_emp = "<mark>";
            $end_emp = "</mark>";
            break;
        case "strong":
            $start_emp = "<strong>";
            $end_emp = "</strong>";
            break;
        case "em":
            $start_emp = "<em>";
            $end_emp = "</em>";
            break;
        case "col":
            $col = get_option("relevanssi_txt_col");
            if (!$col) $col = "#ff0000";
            $start_emp = "<span style='color: $col'>";
            $end_emp = "</span>";
            break;
        case "bgcol":
            $col = get_option("relevanssi_bg_col");
            if (!$col) $col = "#ff0000";
            $start_emp = "<span style='background-color: $col'>";
            $end_emp = "</span>";
            break;
        case "css":
            $css = get_option("relevanssi_css");
            if (!$css) $css = "color: #ff0000";
            $start_emp = "<span style='$css'>";
            $end_emp = "</span>";
            break;
        case "class":
            $css = get_option("relevanssi_class");
            if (!$css) $css = "relevanssi-query-term";
            $start_emp = "<span class='$css'>";
            $end_emp = "</span>";
            break;
        default:
            return $excerpt;
    }

    $start_emp_token = "*[/";
    $end_emp_token = "\]*";

    if ( function_exists('mb_internal_encoding') )
        mb_internal_encoding("UTF-8");

    $terms = array_keys(relevanssi_tokenize($query, $remove_stopwords = true));

    $phrases = relevanssi_extract_phrases(stripslashes($query));

    $non_phrase_terms = array();
    foreach ($phrases as $phrase) {
        $phrase_terms = array_keys(relevanssi_tokenize($phrase, false));
        foreach ($terms as $term) {
            if (!in_array($term, $phrase_terms)) {
                $non_phrase_terms[] = $term;
            }
        }
        $terms = $non_phrase_terms;
        $terms[] = $phrase;
    }

    usort($terms, 'relevanssi_strlen_sort');

    get_option('relevanssi_word_boundaries', 'on') == 'on' ? $word_boundaries = true : $word_boundaries = false;
    foreach ($terms as $term) {
        $pr_term = preg_quote($term, '/');
        if ($word_boundaries) {
            $excerpt = preg_replace("/(\b$pr_term|$pr_term\b)(?!([^<]+)?>)/iu", $start_emp_token . '\\1' . $end_emp_token, $excerpt);
        }
        else {
            $excerpt = preg_replace("/($pr_term)(?!([^<]+)?>)/iu", $start_emp_token . '\\1' . $end_emp_token, $excerpt);
        }
        // thanks to http://pureform.wordpress.com/2008/01/04/matching-a-word-characters-outside-of-html-tags/
    }

    $excerpt = relevanssi_remove_nested_highlights($excerpt, $start_emp_token, $end_emp_token);

    $excerpt = str_replace($start_emp_token, $start_emp, $excerpt);
    $excerpt = str_replace($end_emp_token, $end_emp, $excerpt);
    $excerpt = str_replace($end_emp . $start_emp, "", $excerpt);
    if (function_exists('mb_ereg_replace')) {
        $pattern = $end_emp . '\s*' . $start_emp;
        $excerpt = mb_ereg_replace($pattern, " ", $excerpt);
    }

    return $excerpt;
}

回答1:

Since lookbehind assertions need to be fixed in length, you cannot use them to look for a preceding <script> tag somewhere before the searched term.

So, after you replace all the occurrences of the desired term, you need a second pass to revert back those occurrences of the modified term that appear to be inside a <script> tag.

# provide some sample data
$excerpt = 'My name is bob!

And bob is cool.

<script type="text/javascript">
var bobby = "It works fine even if you already have tagged the term <em>bob</em> inside the script tag.";
alert(bobby);

var bob = 5;
</script>

Yeah, the word "bob" works fine.';

$start_emp_token = '<em>';
$end_emp_token = '</em>';
$pr_term = 'bob';

# replace everything (not in a tag)
$excerpt = preg_replace("/(\b$pr_term|$pr_term\b)(?!([^<]+)?>)/iu", $start_emp_token . '$1' . $end_emp_token, $excerpt);

# undo some of the replacements
$excerpt = preg_replace_callback('#(<script(?:[^>]*)>)(.*?)(</script>)#is',
                       create_function(
                         '$matches',
                         'global $start_emp_token, $end_emp_token, $pr_term;
                          return $matches[1].str_replace("$start_emp_token$pr_term$end_emp_token", "$pr_term", $matches[2]).$matches[3];'
                       ),
                       $excerpt);

var_dump($excerpt);

The code above produces the following output:

string(271) "My name is <em>bob</em>!

And <em>bob</em> is cool.

<script type="text/javascript">
var bobby = "It works fine even if you already have tagged the term <em>bob</em> inside the script tag.";
alert(bobby);

var bob = 5;
</script>

Yeah, the word "<em>bob</em>" works fine."

回答2:

The most accurate approach is to:

Parse the HTML with a proper HTML parser
Ignore the strings that are within the <script> tags.

You don't want to try parsing HTML with regular expressions. Here's an explanation of why: http://htmlparsing.com/regexes.html

It will make you sad in the long run. Please take a look at the rest of http://htmlparsing.com/ for some pointers that could get you started.

回答3:

You mentioned in a comment that it would be acceptable to remove script tags before performing the search.

$data = preg_replace('/<\s*script.*?\/script\s*>/iu', '', $data);

This code may help with that.

回答4:

George, resurrecting this ancient question because it had a simple solution that wasn't mentioned. This situation is straight out of my pet question of the moment, Match (or replace) a pattern except in situations s1, s2, s3 etc

You want to modify the following regex to exclude anything between <script> and </script>:

(\bSOMETERM|SOMETERM\b)(?!([^<]+)?>)

Please forgive me for switching out $term with SOMETERM, it is for clarity because $ has a special meaning in regex.

With all the disclaimers about matching html in regex, to exclude anything between <script> and </script>, you can simply add this to the beginning of your regex:

<script>.*?</script>(*SKIP)(*F)|

so the regex becomes:

<script>.*?</script>(*SKIP)(*F)|(\bSOMETERM|SOMETERM\b)(?!([^<]+)?>)

How does this work?

The left side of the OR (i.e., |) matches complete <script...</script> tags, then deliberately fails. The right side matches what you were matching before, and we know it is the right stuff because if it was between script tags, it would have failed.

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...

来源：https://stackoverflow.com/questions/12532744/regex-ignore-matches-between-script-tags

标签

php