I am new to Regex, however I decided it was the easiest route to what I needed to do. Basically I have a string (in PHP) which contains a whole load of HTML code... I want
Like Michael pointed out, you don't want to use Regex for this purpose. A Regex does not know what an element tag is. <foo>
is as meaningful as >foo<
unless you teach it the difference. Teaching the difference is incredibly tedious though.
DOM is so much more convenient:
$html = <<< HTML
<img src="" style="display:none" />
<IMG src="" style="width:11px;display: none" >
<img src="" style="width:11px" >
HTML;
The above is our (invalid) markup. We feed it to DOM like this:
$dom = new DOMDocument();
$dom->loadHtml($html);
$dom->normalizeDocument();
Now we query the DOM for all "IMG" elements containing a "style" attribute that contains the text "display". We could query for "display: none" in the XPath, but our input markup has occurences with no space inbetween:
$xpath = new DOMXPath($dom);
foreach($xpath->query('//img[contains(@style, "display")]') as $node) {
$style = str_replace(' ', '', $node->getAttribute('style'));
if(strpos($style, 'display:none') !== FALSE) {
$node->parentNode->removeChild($node);
}
}
We iterate over the IMG nodes and remove all whitespace from their style attribute content. Then we check if it contains "display:none" and if so, remove the element from the DOM.
Now we only need to save our HTML:
echo $dom->saveHTML();
gives us:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><img src="" style="width:11px"></body></html>
Screw Regex!
Addendum: you might also be interested in Parsing XML documents with CSS selectors
$html = preg_replace("/<img[^>]+style[^>]+none[^>]+>/", '', $html);
Because <img>
doesn't allow any other elements inside it, this is possible; but in general, regexp is a thoroughly bad tool for parsing a recursively defined language like HTML.
Anyway, the problem you're probably hitting is that the closing > is being matched by one of the .* expressions, and there happens to be a later > on the line to match your explicit > .
If you replace all your .* by [^>]* that will prevent that. (They probably don't all need to be replaced, but you might as well).
Here is another version which works with all tags including ones with spaces between the inline style display:none or display: none. Plus it deletes the content inside the tags.
$html = preg_replace('/<[^>]+style[^>]+display:\s*none[^>]+>.*?>/', '', $html);
So I have tested it with the following and it works fine.
Only show<div style='display:none'>Delete inside content as well</div> this text.
Only show<span style='display: none'>Delete inside content as well</span> this text.
Only show<div style="display: none">Delete inside content as well</div> this text.
Only show<span style="display:none;">Delete inside content as well</span> this text.
Should now only output.
Only show this text.
Your regular expression is way too broad; .*
means "match anything", so this would match:
<img src="foo.png" style="something">Some random displayed text : foo none; bar<br>
At the very least, you probably want to exclude closing brackets from your matches, so [^>]*
instead of .*
. You also might want to read this, though, and look into using something that actually understands HTML, like DOMDocument