Python regex to match text in single quotes, ignoring escaped quotes (and tabs/newlines)

后端 未结 3 516
渐次进展
渐次进展 2020-12-10 07:00

Given a file of text, where the character I want to match are delimited by single-quotes, but might have zero or one escaped single-quote, as well as zero or more tabs and n

相关标签:
3条回答
  • 2020-12-10 07:13

    This should do it:

    menu_item = '((?:[^'\\]|\\')*)'
    

    Here the (?:[^'\\]|\\')* part matches any sequence of any character except ' and \ or a literal \'. The former expression [^'\\] does also allow line breaks and tabulators that you then need to replace by a single space.

    0 讨论(0)
  • 2020-12-10 07:24

    This tested script should do the trick:

    import re
    re_sq_long = r"""
        # Match single quoted string with escaped stuff.
        '            # Opening literal quote
        (            # $1: Capture string contents
          [^'\\]*    # Zero or more non-', non-backslash
          (?:        # "unroll-the-loop"!
            \\.      # Allow escaped anything.
            [^'\\]*  # Zero or more non-', non-backslash
          )*         # Finish {(special normal*)*} construct.
        )            # End $1: String contents.
        '            # Closing literal quote
        """
    re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'"
    
    data = r'''
            menu_item = 'casserole';
            menu_item = 'meat 
                        loaf';
            menu_item = 'Tony\'s magic pizza';
            menu_item = 'hamburger';
            menu_item = 'Dave\'s famous pizza';
            menu_item = 'Dave\'s lesser-known
                gyro';'''
    matches = re.findall(re_sq_long, data, re.DOTALL | re.VERBOSE)
    menu_items = []
    for match in matches:
        match = re.sub('\s+', ' ', match) # Clean whitespace
        match = re.sub(r'\\', '', match)  # remove escapes
        menu_items.append(match)          # Add to menu list
    
    print (menu_items)
    

    Here is the short version of the regex:

    '([^'\\]*(?:\\.[^'\\]*)*)'

    This regex is optimized using Jeffrey Friedl's "unrolling-the-loop" efficiency technique. (See: Mastering Regular Expressions (3rd Edition)) for details.

    Note that the above regex is equivalent to the following one (which is more commonly seen but is much slower on most NFA regex implementations):

    '((?:[^'\\]|\\.)*)'

    0 讨论(0)
  • 2020-12-10 07:24

    You cold try it like this:

    pattern = re.compile(r"menu_item = '(.*?)(?<!\\)'", re.DOTALL)
    

    It will start matching at the first single quote it finds and it ends at the first single quote not preceded by a backslash. It also captures any newlines and tabs found between the two single quotes.

    0 讨论(0)
提交回复
热议问题