Regex to allow only set of HTML Tags and Attributes

后端 未结 4 985
一个人的身影
一个人的身影 2020-12-17 05:00

How to allow only specific set of HTML tags & specific set of Attributes using general Regex?

Allowed HTML Tags:

p|body|b

相关标签:
4条回答
  • 2020-12-17 05:30

    You can't parse HTML with regex (there's a reason why that is one of the top voted posts on Stackoverflow)

    0 讨论(0)
  • 2020-12-17 05:34

    Finally I have achieved this in two steps:-

    //Allowed list of HTML Tags
    
    <(?!/?(p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption)(>|\s))[^<]+?>
    
    //Allowed list of HTML Attributes
    
    \s(?!(alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel))\w+(\s*=\s*["|']?[/.,#?\w\s:;-]+["|']?)
    

    Using above two regex, I have filtered my whole html.

    EDIT:

    Now I have reduced it into one regex, which filter all required HTML tags & attributes

    (<(?!/?(p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption)(>|\s))[^<]+?>)|(\s(?!(alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel)\b)[\w:]+(\s*=\s*["|']?[/.,#?\w\s:;-]+["|']?))
    
    0 讨论(0)
  • 2020-12-17 05:36

    This seems very similar to a question I posted awhile back:

    How do I filter all HTML tags except a certain whitelist?

    0 讨论(0)
  • 2020-12-17 05:39

    Here is a Perl solution using PCRE compatible regex. It is not aware of comments, doctype, CDATA, etc. Those should be added for a more complete solution.

    # allowed tag and attribute names
    
    my $allowed_tags_open = 'p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|a|tr|td|table|tbody|label|div|sup|sub|caption';
    
    my $allowed_tags_self_closing = 'img|br|hr';
    
    my $allowed_attributes = 'alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel';
    
    $allowed_attributes .= '|style'; # for testing
    
    
    # definitions for matching allowed tag and attribute names
    
    my $re_tags = qr~(?(DEFINE)
        (?<tags_open>
            /?+
            (?>
                (?: $allowed_tags_open )
                (?! [^\s>/] )       # from (?&tagname)
            )
        )
        (?<tags_self_closing>
            (?>
                (?: $allowed_tags_self_closing )
                (?! [^\s>/] )       # from (?&tagname)
            )
        )
        (?<tags>    (?> (?&tags_open) | (?&tags_self_closing) )    )
        (?<attribs>
            (?>
                (?: $allowed_attributes )
                (?! [^\s=/>] )      # from (?&attname)
            )
        )
    )~xi;
    
    
    # definitions for matching the tags
    # trying to follow compatible tokenization characteristics of modern browsers
    
    my $re_defs = qr~(?(DEFINE)
        (?<tagname> [a-z/][^\s>/]*+    )    # will match the leading / in closing tags
        (?<attname> [^\s>/][^\s=/>]*+    )  # first char can be pretty much anything, including =
        (?<attval>  (?>
                        "[^"]*+" |
                        \'[^\']*+\' |
                        [^\s>]*+            # unquoted values can contain quotes, = and /
                    )
        ) 
        (?<attrib>  (?&attname)
                    (?: \s*+
                        = \s*+
                        (?&attval)
                    )?+
        )
        (?<crap>    (?!/>)[^\s>]    )       # most crap inside tag is ignored, but don't eat the last / in self closing tags
        (?<tag>     <(?&tagname)
                    (?: \s*+                # spaces between attributes not required: <b/foo=">"style=color:red>bold red text</b>
                        (?>
                            (?&attrib) |    # order matters
                            (?&crap)        # if not an attribute, eat the crap
                        )
                    )*+
                    \s*+ /?+
                    >
        )
    )~xi;
    
    
    
    sub sanitize_html{
        my $str = shift;
        $str =~ s/(?&tag) $re_defs/ sanitize_tag($&) /gexo;
        return $str;
    }
    
    
    sub sanitize_tag{
        my $tag = shift;
    
        my ($name, $attr, $end) =
            $tag =~ /^ < ((?&tags)) (.*?) ( \/?+ > ) $   $re_tags/xo
            or return '';  # return empty string if not allowed tag
    
        # return a new clean closing tag if it's a closing tag
        return "<$name>" if substr($name, 0, 1) eq '/';
    
        # clean attributes
        return "<$name" . sanitize_attributes($attr) . $end;
    }
    
    
    sub sanitize_attributes{
        my $attr = shift;
        my $new = '';
    
        $attr =~ s{
            \G
            \s*+                 # spaces between attributes not required
            (?>
                ( (?&attrib) ) | # order matters
                (?&crap)         # if not an attribute, eat the crap
            )
    
            $re_defs
        }{
            my $att = $1;
            $new .= " $att" if $att && $att =~ /^(?&attribs) $re_tags/xo;
            '';
        }gexo;
    
        return $new;
    }
    

    Test (ideone):

    my $test = <<'_TEST_';
    <b>simple</b>
    self <img>closing</img>
    
    <abc id="test">new tag and known attribute</abc>
    <a id="test" xyz="testattr" href="/foo">one unknown attr</a>
    <a id="foo">attr in closing tag</a id="foo">
    
    <b/#ñ%&/()!¢º`=">="">crap be gone</b> not bold<br/x"/>
    <b/style=color:red;background:url("x.gif");/*="still.CSS*/ id="x"zz"<script class="x">tricky</b/ x=">"//> not bold
    _TEST_
    
    print $test, "\n";
    print '-' x 70, "\n";
    print sanitize_html $test;
    

    Output:

    <b>simple</b>
    self <img>closing</img>
    
    <abc id="test">new tag and known attribute</abc>
    <a id="test" xyz="testattr" href="/foo">one unknown attr</a>
    <a id="foo">attr in closing tag</a id="foo">
    
    <b/#ñ%&/()!¢º`=">="">crap be gone</b> not bold<br/x"/>
    <b/style=color:red;background:url("x.gif");/*="still.CSS*/ id="x"zz"<script class="x">tricky</b/ x=">"//> not bold
    
    ----------------------------------------------------------------------
    <b>simple</b>
    self <img>closing
    
    new tag and known attribute
    <a id="test" href="/foo">one unknown attr</a>
    <a id="foo">attr in closing tag</a>
    
    <b>crap be gone</b> not bold<br/>
    <b style=color:red;background:url("x.gif");/*="still.CSS*/ id="x" class="x">tricky</b> not bold
    

    See how your browser parses the tricky tags: jsFiddle

    Possibly relevant:

    • HTML Tokenization
    • XSS Cheat Sheet
    • HTML5 Security
    • XML tag parsing
    0 讨论(0)
提交回复
热议问题