Hi have the following content within an html page that stretches multiple lines
you can also use [\s\S]
instead of .
combined with the DOTALL flag s
for matching everyting because [\s\S]
means exactly the same: match everything; \s matches all space-characters (including newline) and \S machtes everything that is not a space-character (i.e. everything else). in some cases/implementations of regular expressions, this works better than enabling DOTALL
caution: .*
with the flag for DOTALL as well as [\s\S]
are both "hungry" and won't stop reading the string. if you want them to stop at a certain position, (e.g. the first </div>), use the non-greedy operator ?
behind your quantifier, e.g. .*?
It is possible to use regex to strip out chunks of html data, but you need to wrap the html with custom html tags which get ignored by browsers. For example:
<?php
$html='
<div>This will be shown</div>
<custom650 rel="nofollow">
<p class="subformedit">
<a href="#" class="mylink">Link</a>
<div class="morestuff">
... more html in here ...
</div>
</p>
</custom650>
<div>This will also be shown</div>
';
To strip the tags with the rel="nofollow" attributes, you can use the following regex:
$newhtml = preg_replace('/<([^\s]+)[^>]*rel="nofollow"[^>]*>.*?<\/\1>/si', '', $html);
From experience, start the custom tags on a new line. Undoubtedly a hack, but might help someone.
it is the "s" flag, it enables . to capture newlines
If this weren't HTML, I'd tell you to use the DOTALL modifier to change the meaning of .
from 'match everything except new line' to 'match everything':
preg_replace('/(.*)<\/div>/s','abc',$body);
But this is HTML, so use an HTML parser instead.