PHP: split string on comma, but NOT when between braces or quotes?

徘徊边缘 提交于 2019-11-27 15:36:47

Instead of a preg_split, do a preg_match_all:

$str = "AAA, BBB, (CCC,DDD), 'EEE', 'FFF,GGG', ('HHH','III'), (('JJJ','KKK'), LLL, (MMM,NNN)) , OOO"; 

preg_match_all("/\((?:[^()]|(?R))+\)|'[^']*'|[^(),\s]+/", $str, $matches);

print_r($matches);

will print:

Array
(
    [0] => Array
        (
            [0] => AAA
            [1] => BBB
            [2] => (CCC,DDD)
            [3] => 'EEE'
            [4] => 'FFF,GGG'
            [5] => ('HHH','III')
            [6] => (('JJJ','KKK'), LLL, (MMM,NNN))
            [7] => OOO
        )

)

The regex \((?:[^()]|(?R))+\)|'[^']*'|[^(),\s]+ can be divided in three parts:

  1. \((?:[^()]|(?R))+\), which matches balanced pairs of parenthesis
  2. '[^']*' matching a quoted string
  3. [^(),\s]+ which matches any char-sequence not consisting of '(', ')', ',' or white-space chars

Crazy solution

A spartan regex that tokenizes and also validates all the tokens that it extracts:

\G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()',\s]++\s*+(?(?!\)),)|\s*+'[^'\r\n]*+'\s*+(?(?!\)),))++\))|[^()',\s]++|'[^'\r\n]*+')\s*+(?:,|$)

Regex101

Put it in string literal, with delimiter:

'/\G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()\',\s]++\s*+(?(?!\)),)|\s*+\'[^\'\r\n]*+\'\s*+(?(?!\)),))++\))|[^()\',\s]++|\'[^\'\r\n]*+\')\s*+(?:,|$)/'

ideone

The result is in capturing group 1. In the example on ideone, I specify PREG_OFFSET_CAPTURE flag, so that you can check against the last match in group 0 (entire match) whether the entire source string has been consumed or not.

Assumptions

  • Non-quoted text may not contain any whitespace character, as defined by \s. Consequently, it may not span multiple lines.
  • Non-quoted text may not contain (, ), ' or ,.
  • Non-quoted text must contain at least 1 character.
  • Single quoted text may not span multiple lines.
  • Single quoted text may not contain quote. Consequently, there is no way to specify '.
  • Single quoted text may be empty.
  • Bracket token contains one or more of the following as sub-tokens: non-quoted text token, single quoted text token, or another bracket token.
  • In bracket token, 2 adjacent sub-tokens are separated by exactly one ,
  • Bracket token starts with ( and ends with ).
  • Consequently, a bracket token must have balanced brackets, and empty bracket () is not allowed.
  • Input will contain one or more of: non-quoted text, single quoted text or bracket token. The tokens in the input are separated with comma ,. Single trailing comma , is considered valid.
  • Whitespace character (as defined by \s, which includes new line character) are arbitrarily allowed between token(s), comma(s) , separating tokens, and the bracket(s) (, ) of the bracket tokens.

Breakdown

\G\s*+
(
  (
    \(
    (?:
        \s*+
        (?2)
        \s*+
        (?(?!\)),)
      |
        \s*+
        [^()',\s]++
        \s*+
        (?(?!\)),)
      |
        \s*+
        '[^'\r\n]*+'
        \s*+
        (?(?!\)),)
    )++
    \)
  )
  |
  [^()',\s]++
  |
  '[^'\r\n]*+'
)
\s*+(?:,|$)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!