Python regex to match text in single quotes, ignoring escaped quotes (and tabs/newlines)

后端 未结 3 519
渐次进展
渐次进展 2020-12-10 07:00

Given a file of text, where the character I want to match are delimited by single-quotes, but might have zero or one escaped single-quote, as well as zero or more tabs and n

3条回答
  •  死守一世寂寞
    2020-12-10 07:24

    This tested script should do the trick:

    import re
    re_sq_long = r"""
        # Match single quoted string with escaped stuff.
        '            # Opening literal quote
        (            # $1: Capture string contents
          [^'\\]*    # Zero or more non-', non-backslash
          (?:        # "unroll-the-loop"!
            \\.      # Allow escaped anything.
            [^'\\]*  # Zero or more non-', non-backslash
          )*         # Finish {(special normal*)*} construct.
        )            # End $1: String contents.
        '            # Closing literal quote
        """
    re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'"
    
    data = r'''
            menu_item = 'casserole';
            menu_item = 'meat 
                        loaf';
            menu_item = 'Tony\'s magic pizza';
            menu_item = 'hamburger';
            menu_item = 'Dave\'s famous pizza';
            menu_item = 'Dave\'s lesser-known
                gyro';'''
    matches = re.findall(re_sq_long, data, re.DOTALL | re.VERBOSE)
    menu_items = []
    for match in matches:
        match = re.sub('\s+', ' ', match) # Clean whitespace
        match = re.sub(r'\\', '', match)  # remove escapes
        menu_items.append(match)          # Add to menu list
    
    print (menu_items)
    

    Here is the short version of the regex:

    '([^'\\]*(?:\\.[^'\\]*)*)'

    This regex is optimized using Jeffrey Friedl's "unrolling-the-loop" efficiency technique. (See: Mastering Regular Expressions (3rd Edition)) for details.

    Note that the above regex is equivalent to the following one (which is more commonly seen but is much slower on most NFA regex implementations):

    '((?:[^'\\]|\\.)*)'

提交回复
热议问题