Regex: Strip HTML attributes except SRC

前端 未结 6 954
予麋鹿
予麋鹿 2020-12-16 17:13

I\'m trying to write a regular expression that will strip all tag attributes except for the SRC attribute. For example:

6条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-12-16 17:38

    This might work for your needs:

    $text = '

    This is a paragraph with an image

    '; echo preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\ssrc=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i",'<$1$2$3>', $text); //

    This is a paragraph with an image

    The RegExp broken down:

    /              # Start Pattern
     <             # Match '<' at beginning of tags
     (             # Start Capture Group $1 - Tag Name
      [a-z]         # Match 'a' through 'z'
      [a-z0-9]*     # Match 'a' through 'z' or '0' through '9' zero or more times
     )             # End Capture Group
     (?:           # Start Non-Capture Group
      [^>]*         # Match anything other than '>', Zero or More Times
      (             # Start Capture Group $2 - ' src="...."'
       \s            # Match one whitespace
       src=          # Match 'src='
       ['"]          # Match ' or "
       [^'"]*        # Match anything other than ' or " 
       ['"]          # Match ' or "
      )             # End Capture Group 2
     )?            # End Non-Capture Group, match group zero or one time
     [^>]*?        # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
     (\/?)         # Capture Group $3 - '/' if it is there
     >             # Match '>'
    /i            # End Pattern - Case Insensitive
    

    Add some quoting, and use the replacement text <$1$2$3> it should strip any non src= properties from well-formed HTML tags.

    Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp people are so cleverly noting below. There are a few fallbacks, most notably

    would end up

    "> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a full proof tags/attributes filter in PHP

提交回复
热议问题