How do I regex split by space, avoiding spaces within apostrophes?

左心房为你撑大大i 提交于 2020-03-16 08:45:10

问题


I want "git log --format='(%h) %s' --abbrev=7 HEAD" to be split into

[
  "git", 
  "log",
  "--format='(%h) %s'",
  "--abbrev=7",
  "HEAD"
]

How to I achieve this, without splitting on the space within --format='(%h) %s'?

Answers in any language is welcome :)


回答1:


As often in life, you have choices.


  1. Use an expression that matches and captures different parts. This can be combined with a replacement function as in

    import re
    string = "git log --format='(%h) %s' --abbrev=7 HEAD"
    
    rx = re.compile(r"'[^']*'|(\s+)")
    
    def replacer(match):
        if match.group(1):
            return "#@#"
        else:
            return match.group(0)
    
    string = rx.sub(replacer, string)
    parts = re.split('#@#', string)
    #                 ^^^ same as in the function replacer
    
  2. You could use the better regex module with (*SKIP)(*FAIL):

    import regex as re
    string = "git log --format='(%h) %s' --abbrev=7 HEAD"
    
    rx = re.compile(r"'[^']*'(*SKIP)(*FAIL)|\s+")
    parts = rx.split(string)
    
  3. Write yourself a little parser:

    def little_parser(string):
        quote = False
        stack = ''
    
        for char in string:
            if char == "'":
                stack += char
                quote = not quote
            elif (char == ' ' and not quote):
                yield stack
                stack = ''
            else:
                stack += char
    
        if stack:
            yield stack
    
    for part in little_parser(your_string):
        print(part)
    



All three will yield
['git', 'log', "--format='(%h) %s'", '--abbrev=7', 'HEAD']



回答2:


As I understand, the idea is to split the string on contiguous spaces except where the spaces are part of a substring surrounded by single quotes. I believe this will work:

/(?:[^ ']*(?:'[^']+')?[^ ']*)*/

but invite readers to subject it to careful scrutiny.

demo

This regex can be made self-documenting by writing it in free-spacing mode:

/
(?:         # begin a non-capture group
  [^ ']*    # match 0+ chars other than spaces and single quotes
  (?:       # begin non-capture group
    '[^']+' # match 1+ chars other than single quotes, surrounded
            # by single quotes 
  )?        # end non-capture group and make it optional
  [^ ']*    # match 0+ chars other than spaces and single quotes
)*          # end non-capture group and execute it 0+ times
/x          # free-spacing regex definition mode

This obviously will not work if there are nested single quotes.

@n.'pronouns'm. suggested an alternative regex that also works:

/([^ "']|'[^'"]*')*/

demo




回答3:


I found one possible (albeit ugly) solution in python (which also works with "):

>>> import re
>>> foo = '''git log --format='(%h) %s' --foo="a b" --bar='c d' HEAD'''
>>> re.findall(r'''(\S*'[^']+'\S*|\S*"[^"]+"\S*|\S+)''', foo)
['git', 'log', "--format='(%h) %s'", '--foo="a b"', "--bar='c d'", 'HEAD']



来源:https://stackoverflow.com/questions/60502562/how-do-i-regex-split-by-space-avoiding-spaces-within-apostrophes

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!