regex (in PHP) to match & that aren't HTML entities

后端 未结 5 1603
说谎
说谎 2020-12-17 05:01

Here\'s the goal: to replace all standalone ampersands with & but NOT replace those that are already part of an HTML entity such as  .

I think I nee

5条回答
  •  孤城傲影
    2020-12-17 05:34

    The others are good suggestions, and might be better way to do it. But I thought I'd try to answer the question as asked--if only to provide a regex example.

    The following is the special exploded form allowed in some engines. Of course the odd thing is that an engine which allows commented regexes allow other simplified expresssions--but not as generic. I'll put those simplified expressions in parens in the comments.

    &                      # an ampersand
    ( \#                   # a '#' character
      [1-9]                # followed by a non-zero digit, 
      [0-9]{1,3}           # with between 2 and 4             (\d{1,3} or \p{IsDigit}{1,3})
    | [A-Za-z]             # OR a letter                      (\p{IsAlpha})
      [0-9A-Za-z]+         # followed by letters or numbers   (\p{IsAlnum}+)
    )
    ;                      # all capped with a ';'
    

    You could even throw a bunch of expected entities in there as well, to help out the regex scanner.

    &                      # an ampersand
    ( amp | apos | gt | lt | nbsp | quot                 
                           # standard entities
    | bull | hellip | [lr][ds]quo | [mn]dash | permil          
                           # some fancier ones
    | \#                   # a '#' character
      [1-9]                # followed by a non-zero digit, 
      [0-9]{1,3}           # with between 2 and 4 
    |  [A-Za-z]            # OR a letter
      [0-9A-Za-z]+         # followed by letters or numbers
    )
    ;                      # all capped with a ';'
    

提交回复
热议问题