Regex: Strip HTML attributes except SRC

前端 未结 6 941
予麋鹿
予麋鹿 2020-12-16 17:13

I\'m trying to write a regular expression that will strip all tag attributes except for the SRC attribute. For example:

相关标签:
6条回答
  • 2020-12-16 17:28

    Posting to provide a solution for Oracle Regex

    <([^!][a-z][a-z0-9]*)([^>]*(\ssrc=[''''\"][^''''\"]*[''''\"]))?[^>]*?(\/?)>
    
    0 讨论(0)
  • 2020-12-16 17:30

    You usually should not parse HTML using regular expressions.

    Instead, you should call DOMDocument::loadHTML.
    You can then recurse through the elements in the document and call removeAttribute.

    0 讨论(0)
  • 2020-12-16 17:38

    This might work for your needs:

    $text = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';
    
    echo preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\ssrc=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i",'<$1$2$3>', $text);
    
    // <p>This is a paragraph with an image <img src="/path/to/image.jpg"/></p>
    

    The RegExp broken down:

    /              # Start Pattern
     <             # Match '<' at beginning of tags
     (             # Start Capture Group $1 - Tag Name
      [a-z]         # Match 'a' through 'z'
      [a-z0-9]*     # Match 'a' through 'z' or '0' through '9' zero or more times
     )             # End Capture Group
     (?:           # Start Non-Capture Group
      [^>]*         # Match anything other than '>', Zero or More Times
      (             # Start Capture Group $2 - ' src="...."'
       \s            # Match one whitespace
       src=          # Match 'src='
       ['"]          # Match ' or "
       [^'"]*        # Match anything other than ' or " 
       ['"]          # Match ' or "
      )             # End Capture Group 2
     )?            # End Non-Capture Group, match group zero or one time
     [^>]*?        # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
     (\/?)         # Capture Group $3 - '/' if it is there
     >             # Match '>'
    /i            # End Pattern - Case Insensitive
    

    Add some quoting, and use the replacement text <$1$2$3> it should strip any non src= properties from well-formed HTML tags.

    Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp people are so cleverly noting below. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a full proof tags/attributes filter in PHP

    0 讨论(0)
  • 2020-12-16 17:43

    Alright, here's what I used that seems to be working well:

    <([A-Z][A-Z0-9]*)(\b[^>src]*)(src\=[\'|"|\s]?[^\'][^"][^\s]*[\'|"|\s]?)?(\b[^>]*)>
    

    Feel free to poke any holes in it.

    0 讨论(0)
  • 2020-12-16 17:44

    As above introduced you shouldn use regex to parse html, or xml.

    I would do your example with str_replace(); if its all time the same.

    $str = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';
    
    $str = str_replace('id="paragraph" class="green"', "", $str);
    
    $str = str_replace('width="50" height="75"',"",$str);
    
    0 讨论(0)
  • 2020-12-16 17:52

    Unfortunately I'm not sure how to answer this question for PHP. If I were using Perl I would do the following:

    use strict;
    my $data = q^<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>^;
    
    $data =~ s{
        <([^/> ]+)([^>]+)> # split into tagtype, attribs
    }{
        my $attribs = $2;
        my @parts = split( /\s+/, $attribs ); # separate by whitespace
        @parts = grep { m/^src=/i } @parts;   # retain just src tags
        if ( @parts ) {
            "<" . join( " ", $1, @parts ) . ">";
        } else {
            "<" . $1 . ">";
        }
    }xseg;
    
    print( $data );
    

    which returns

    <p>This is a paragraph with an image <img src="/path/to/image.jpg"></p>
    
    0 讨论(0)
提交回复
热议问题