How to allow only specific set of HTML tags & specific set of Attributes using general Regex?
Allowed HTML Tags:
p|body|b
You can't parse HTML with regex (there's a reason why that is one of the top voted posts on Stackoverflow)
Finally I have achieved this in two steps:-
//Allowed list of HTML Tags
<(?!/?(p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption)(>|\s))[^<]+?>
//Allowed list of HTML Attributes
\s(?!(alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel))\w+(\s*=\s*["|']?[/.,#?\w\s:;-]+["|']?)
Using above two regex, I have filtered my whole html.
Now I have reduced it into one regex, which filter all required HTML tags & attributes
(<(?!/?(p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption)(>|\s))[^<]+?>)|(\s(?!(alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel)\b)[\w:]+(\s*=\s*["|']?[/.,#?\w\s:;-]+["|']?))
This seems very similar to a question I posted awhile back:
How do I filter all HTML tags except a certain whitelist?
Here is a Perl solution using PCRE compatible regex. It is not aware of comments, doctype, CDATA, etc. Those should be added for a more complete solution.
# allowed tag and attribute names
my $allowed_tags_open = 'p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|a|tr|td|table|tbody|label|div|sup|sub|caption';
my $allowed_tags_self_closing = 'img|br|hr';
my $allowed_attributes = 'alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel';
$allowed_attributes .= '|style'; # for testing
# definitions for matching allowed tag and attribute names
my $re_tags = qr~(?(DEFINE)
(?<tags_open>
/?+
(?>
(?: $allowed_tags_open )
(?! [^\s>/] ) # from (?&tagname)
)
)
(?<tags_self_closing>
(?>
(?: $allowed_tags_self_closing )
(?! [^\s>/] ) # from (?&tagname)
)
)
(?<tags> (?> (?&tags_open) | (?&tags_self_closing) ) )
(?<attribs>
(?>
(?: $allowed_attributes )
(?! [^\s=/>] ) # from (?&attname)
)
)
)~xi;
# definitions for matching the tags
# trying to follow compatible tokenization characteristics of modern browsers
my $re_defs = qr~(?(DEFINE)
(?<tagname> [a-z/][^\s>/]*+ ) # will match the leading / in closing tags
(?<attname> [^\s>/][^\s=/>]*+ ) # first char can be pretty much anything, including =
(?<attval> (?>
"[^"]*+" |
\'[^\']*+\' |
[^\s>]*+ # unquoted values can contain quotes, = and /
)
)
(?<attrib> (?&attname)
(?: \s*+
= \s*+
(?&attval)
)?+
)
(?<crap> (?!/>)[^\s>] ) # most crap inside tag is ignored, but don't eat the last / in self closing tags
(?<tag> <(?&tagname)
(?: \s*+ # spaces between attributes not required: <b/foo=">"style=color:red>bold red text</b>
(?>
(?&attrib) | # order matters
(?&crap) # if not an attribute, eat the crap
)
)*+
\s*+ /?+
>
)
)~xi;
sub sanitize_html{
my $str = shift;
$str =~ s/(?&tag) $re_defs/ sanitize_tag($&) /gexo;
return $str;
}
sub sanitize_tag{
my $tag = shift;
my ($name, $attr, $end) =
$tag =~ /^ < ((?&tags)) (.*?) ( \/?+ > ) $ $re_tags/xo
or return ''; # return empty string if not allowed tag
# return a new clean closing tag if it's a closing tag
return "<$name>" if substr($name, 0, 1) eq '/';
# clean attributes
return "<$name" . sanitize_attributes($attr) . $end;
}
sub sanitize_attributes{
my $attr = shift;
my $new = '';
$attr =~ s{
\G
\s*+ # spaces between attributes not required
(?>
( (?&attrib) ) | # order matters
(?&crap) # if not an attribute, eat the crap
)
$re_defs
}{
my $att = $1;
$new .= " $att" if $att && $att =~ /^(?&attribs) $re_tags/xo;
'';
}gexo;
return $new;
}
my $test = <<'_TEST_';
<b>simple</b>
self <img>closing</img>
<abc id="test">new tag and known attribute</abc>
<a id="test" xyz="testattr" href="/foo">one unknown attr</a>
<a id="foo">attr in closing tag</a id="foo">
<b/#ñ%&/()!¢º`=">="">crap be gone</b> not bold<br/x"/>
<b/style=color:red;background:url("x.gif");/*="still.CSS*/ id="x"zz"<script class="x">tricky</b/ x=">"//> not bold
_TEST_
print $test, "\n";
print '-' x 70, "\n";
print sanitize_html $test;
Output:
<b>simple</b>
self <img>closing</img>
<abc id="test">new tag and known attribute</abc>
<a id="test" xyz="testattr" href="/foo">one unknown attr</a>
<a id="foo">attr in closing tag</a id="foo">
<b/#ñ%&/()!¢º`=">="">crap be gone</b> not bold<br/x"/>
<b/style=color:red;background:url("x.gif");/*="still.CSS*/ id="x"zz"<script class="x">tricky</b/ x=">"//> not bold
----------------------------------------------------------------------
<b>simple</b>
self <img>closing
new tag and known attribute
<a id="test" href="/foo">one unknown attr</a>
<a id="foo">attr in closing tag</a>
<b>crap be gone</b> not bold<br/>
<b style=color:red;background:url("x.gif");/*="still.CSS*/ id="x" class="x">tricky</b> not bold
See how your browser parses the tricky tags: jsFiddle
Possibly relevant: