PHP: split string on comma, but NOT when between braces or quotes?

前端 未结 2 1102
梦毁少年i
梦毁少年i 2020-12-03 22:45

In PHP I have the following string :

$str = \"AAA, BBB, (CCC,DDD), \'EEE\', \'FFF,GGG\', (\'HHH\',\'III\'), ((\'JJJ\',\'KKK\'), LLL, (MMM,NNN)) , OOO\"; 
         


        
2条回答
  •  囚心锁ツ
    2020-12-03 23:37

    Crazy solution

    A spartan regex that tokenizes and also validates all the tokens that it extracts:

    \G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()',\s]++\s*+(?(?!\)),)|\s*+'[^'\r\n]*+'\s*+(?(?!\)),))++\))|[^()',\s]++|'[^'\r\n]*+')\s*+(?:,|$)
    

    Regex101

    Put it in string literal, with delimiter:

    '/\G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()\',\s]++\s*+(?(?!\)),)|\s*+\'[^\'\r\n]*+\'\s*+(?(?!\)),))++\))|[^()\',\s]++|\'[^\'\r\n]*+\')\s*+(?:,|$)/'
    

    ideone

    The result is in capturing group 1. In the example on ideone, I specify PREG_OFFSET_CAPTURE flag, so that you can check against the last match in group 0 (entire match) whether the entire source string has been consumed or not.

    Assumptions

    • Non-quoted text may not contain any whitespace character, as defined by \s. Consequently, it may not span multiple lines.
    • Non-quoted text may not contain (, ), ' or ,.
    • Non-quoted text must contain at least 1 character.
    • Single quoted text may not span multiple lines.
    • Single quoted text may not contain quote. Consequently, there is no way to specify '.
    • Single quoted text may be empty.
    • Bracket token contains one or more of the following as sub-tokens: non-quoted text token, single quoted text token, or another bracket token.
    • In bracket token, 2 adjacent sub-tokens are separated by exactly one ,
    • Bracket token starts with ( and ends with ).
    • Consequently, a bracket token must have balanced brackets, and empty bracket () is not allowed.
    • Input will contain one or more of: non-quoted text, single quoted text or bracket token. The tokens in the input are separated with comma ,. Single trailing comma , is considered valid.
    • Whitespace character (as defined by \s, which includes new line character) are arbitrarily allowed between token(s), comma(s) , separating tokens, and the bracket(s) (, ) of the bracket tokens.

    Breakdown

    \G\s*+
    (
      (
        \(
        (?:
            \s*+
            (?2)
            \s*+
            (?(?!\)),)
          |
            \s*+
            [^()',\s]++
            \s*+
            (?(?!\)),)
          |
            \s*+
            '[^'\r\n]*+'
            \s*+
            (?(?!\)),)
        )++
        \)
      )
      |
      [^()',\s]++
      |
      '[^'\r\n]*+'
    )
    \s*+(?:,|$)
    

提交回复
热议问题