Highlight a word of text on the page using .replace()

问题

I'm developing a Google Chrome extension that allows you to automatically apply a highlighting CSS rule to a word that you choose.

I have the following code

var elements = document.getElementsByTagName('*');

for (var i=0; i<elements.length; i++) {
    var element = elements[i];

    for (var j=0; j<element.childNodes.length; j++) {
        var node = element.childNodes[j];

        if(node.nodeType === 3) {
            var text = node.nodeValue;

            var fetchedText = text.match(/teste/gi);

            if(fetchedText) {
                var replacedText = element.innerHTML.replace(/(teste)/gi, "<span style=\"background-color: yellow\">$1</span>");

                if (replacedText !== text) {
                    element.innerHTML = replacedText;
                }
            }
        }
    }
}

Which breaks and freezes my Chrome tab. However, if I switch from element.innerHTML = replacedText; to element.innerHTML = "text"; this works.

I can't seem to find what's wrong with the following code.

回答1:

You are first testing the #text nodes to see if the text contains the word you are trying to highlight, but then performing the replacement on the .innerHTML of the parent element. There are a couple of problems with this.

Infinite replacements: When you modify the .innerHTML of the parent element you change the childNodes array. You do so in a way that has added a node further in the array containing the text which is to be replaced. Thus, when you continue scanning the childNodes array you always find a (new) node that contains the text you want to replace. So, you replace it again, creating another node that has a higher index in the childNodes array. Repeat infinitely.
Using a RegExp to replace text in the .innerHTML property. While you have already tested to make sure the text you desire to replace is actually contained in a text node, this does not prevent your RegExp from also replacing any matching words within the actual HTML of the element (e.g. in src="yourWord", href="http://foo.com/yourWord/bar.html", or if attempting to highlight words like style, color, background, span, id, height, width, button, form, input, etc.).
You are not checking to make sure you are not changing text in <script> or <style> tags.
You are checking that you only make changed in text nodes (i.e. you check for node.nodeType === 3). If you weren't checking for this you would also have the following possible problems due to using .innerHTML to change HTML:
- You could end up changing attributes, or actual HTML tags, depending on what you are changing with .replace(). This could completely disrupt the page layout and functionality.
- When you change .innerHTML the DOM for that portion of the page is completely recreated. This means the elements, while new elements might be the same type with the same attributes, any event listeners which were attached to the old elements will not be attached to the new elements. This can significantly disrupt the functionality of a page.
- Repeatedly changing large portions of the DOM can be quite compute intensive to re-render the page. Depending on how you do this, you may run into significant user-perceived performance issues.

Thus, if you are going to use a RegExp to replace the text, you need to perform the operation only on the contents of the #text node, not on the .innerHTML of the parent node. Because you are wanting to create additional HTML elements (e.g. new <span style=""> elements, with child #text nodes), there are some complications.

Can not assign HTML text to a text node to create new HTML nodes:

There is no way to assign new HTML directly to a text node and have it evaluated as HTML, creating new nodes. Assigning to a text node's .innerHTML property will create such a property on the Object (just like it would on any Object), but will not change the text displayed on the screen (i.e. the actual value of the #text node). Thus, it will not accomplish what you are wanting to do: it will not create any new HTML children of the parent node.

The way to do this that has the least impact on the page's DOM (i.e. least likely to break existing JavaScript on the page) is to create a <span> to include the new text nodes you are creating (the text that was in the #text node that is not in your colored <span>) along with the potentially multiple <span> elements you are creating. This will result in replacing a single #text node with a single <span> element. While this will create additional descendants, it will leave the number of children in the parent element unchanged. Thus, any JavaScript which was relying on that will not be affected. Given that we are changing the DOM, there is no way to not potentially break other JavaScript, but this should minimize that possibility.

Some examples of how you can do this: See this answer (replaces a list of words with those words in buttons) and this answer (places all text in <p> elements which is separated by spaces into buttons) for full extensions that perform regex replace with new HTML. See this answer which does basically the same thing, but makes a link (it has a different implementation which traverses the DOM with a TreeWalker to find #text nodes instead of a NodeIterator as used in the other two examples).

Here is code which will perform the replacement which you are desiring on each text node in the document.body and create the new HTML needed to have the style be different in a portion of the text:

function handleTextNode(textNode) {
    if(textNode.nodeName !== '#text'
        || textNode.parentNode.nodeName === 'SCRIPT' 
        || textNode.parentNode.nodeName === 'STYLE'
    ) {
        //Don't do anything except on text nodes, which are not children 
        //  of <script> or <style>.
        return;
    }
    let origText = textNode.textContent;
    let newHtml=origText.replace(/(teste)/gi
                                 ,'<span style="background-color: yellow">$1</span>');
    //Only change the DOM if we actually made a replacement in the text.
    //Compare the strings, as it should be faster than a second RegExp operation and
    //  lets us use the RegExp in only one place for maintainability.
    if( newHtml !== origText) {
        let newSpan = document.createElement('span');
        newSpan.innerHTML = newHtml;
        textNode.parentNode.replaceChild(newSpan,textNode);
    }
}

let textNodes = [];
//Create a NodeIterator to get the text nodes in the body of the document
let nodeIter = document.createNodeIterator(document.body,NodeFilter.SHOW_TEXT);
let currentNode;
//Add the text nodes found to the list of text nodes to process.
while(currentNode = nodeIter.nextNode()) {
    textNodes.push(currentNode);
}
//Process each text node
textNodes.forEach(function(el){
    handleTextNode(el);
});

There are other ways to do this. However, they will generate more significant changes to the structure of the children for that specific element (e.g. multiple additional nodes on the parent). Doing so has a higher potential of breaking any JavaScript already on the page which is relying on the current structure of the page. Actually, any change like this has the potential to break current JavaScript.

^{The code for in this answer was modified from the code in this other answer of mine}

回答2:

The error I was experiencing was due to a recursive loop because, for instance, I was looking for the keyword teste and I was inserting a new element with the content <span style=\"background-color: #ffff00\">teste</span> which would force the script to try to replace the new keyword teste again and so on.

I came up with this function:

function applyReplacementRule(node) {
    // Ignore any node whose tag is banned
    if (!node || $.inArray(node.tagName, hwBannedTags) !== -1) { return; }

    try {
        $(node).contents().each(function (i, v) {
            // Ignore any child node that has been replaced already or doesn't contain text
            if (v.isReplaced || v.nodeType !== Node.TEXT_NODE) { return; }

            // Apply each replacement in order
            hwReplacements.then(function (replacements) {
                replacements.words.forEach(function (replacement) {
                    //if( !replacement.active ) return;
                    var matchedText = v.textContent.match(new RegExp(replacement, "i"));

                    if (matchedText) {
                        // Use `` instead of '' or "" if you want to use ${variable} inside a string
                        // For more information visit https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals
                        var replacedText = node.innerHTML.replace(new RegExp(`(${replacement})`, "i"), "<span style=\"background-color: #ffff00\">$1</span>");

                        node.innerHTML = replacedText;
                    }
                });
            }).catch(function (reason) {
                console.log("Handle rejected promise (" + reason + ") here.");
            });

            v.isReplaced = true;
        });
    } catch (err) {
        // Basically this means that an iframe had a cross-domain source
        if (err.name !== "SecurityError")
        { throw err; }
    }
}

Where I modify the node property and "tell" that I've already modify that node so I don't end up on a recursive infinite loop again.

P.S. As you can see this solution uses jQuery. I'll try to rewrite this to use just Vanilla JS.

来源：https://stackoverflow.com/questions/40710728/highlight-a-word-of-text-on-the-page-using-replace

标签

javascript

google-chrome-extension

textnode