Regex Remove Images with style tag from Html

后端 未结 5 1194
遥遥无期
遥遥无期 2020-12-11 08:57

I am new to Regex, however I decided it was the easiest route to what I needed to do. Basically I have a string (in PHP) which contains a whole load of HTML code... I want

相关标签:
5条回答
  • 2020-12-11 09:04

    Like Michael pointed out, you don't want to use Regex for this purpose. A Regex does not know what an element tag is. <foo> is as meaningful as >foo< unless you teach it the difference. Teaching the difference is incredibly tedious though.

    DOM is so much more convenient:

    $html = <<< HTML
    <img src="" style="display:none" />
    <IMG src="" style="width:11px;display: none" >
    <img src="" style="width:11px" >
    HTML;
    

    The above is our (invalid) markup. We feed it to DOM like this:

    $dom = new DOMDocument();
    $dom->loadHtml($html);
    $dom->normalizeDocument();
    

    Now we query the DOM for all "IMG" elements containing a "style" attribute that contains the text "display". We could query for "display: none" in the XPath, but our input markup has occurences with no space inbetween:

    $xpath = new DOMXPath($dom);
    foreach($xpath->query('//img[contains(@style, "display")]') as $node) {
        $style = str_replace(' ', '', $node->getAttribute('style'));
        if(strpos($style, 'display:none') !== FALSE) {
            $node->parentNode->removeChild($node);
        }
    }
    

    We iterate over the IMG nodes and remove all whitespace from their style attribute content. Then we check if it contains "display:none" and if so, remove the element from the DOM.

    Now we only need to save our HTML:

    echo $dom->saveHTML();
    

    gives us:

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html><body><img src="" style="width:11px"></body></html>
    

    Screw Regex!


    Addendum: you might also be interested in Parsing XML documents with CSS selectors

    0 讨论(0)
  • 2020-12-11 09:04
    $html = preg_replace("/<img[^>]+style[^>]+none[^>]+>/", '', $html);
    
    0 讨论(0)
  • 2020-12-11 09:06

    Because <img> doesn't allow any other elements inside it, this is possible; but in general, regexp is a thoroughly bad tool for parsing a recursively defined language like HTML.

    Anyway, the problem you're probably hitting is that the closing > is being matched by one of the .* expressions, and there happens to be a later > on the line to match your explicit > .

    If you replace all your .* by [^>]* that will prevent that. (They probably don't all need to be replaced, but you might as well).

    0 讨论(0)
  • 2020-12-11 09:08

    Here is another version which works with all tags including ones with spaces between the inline style display:none or display: none. Plus it deletes the content inside the tags.

    $html = preg_replace('/<[^>]+style[^>]+display:\s*none[^>]+>.*?>/', '', $html);
    

    So I have tested it with the following and it works fine.

    Only show<div style='display:none'>Delete inside content as well</div> this text.
    
    Only show<span style='display: none'>Delete inside content as well</span> this text.
    
    Only show<div style="display: none">Delete inside content as well</div> this text.
    
    Only show<span style="display:none;">Delete inside content as well</span> this text.
    

    Should now only output.

    Only show this text.
    
    0 讨论(0)
  • 2020-12-11 09:14

    Your regular expression is way too broad; .* means "match anything", so this would match:

    <img src="foo.png" style="something">Some random displayed text : foo none; bar<br>
    

    At the very least, you probably want to exclude closing brackets from your matches, so [^>]* instead of .*. You also might want to read this, though, and look into using something that actually understands HTML, like DOMDocument

    0 讨论(0)
提交回复
热议问题