Removing hash comments that are not inside quotes

问题

I am using python to go through a file and remove any comments. A comment is defined as a hash and anything to the right of it as long as the hash isn't inside double quotes. I currently have a solution, but it seems sub-optimal:

filelines = []
    r = re.compile('(".*?")')
    for line in f:
        m = r.split(line)
        nline = ''
        for token in m:
            if token.find('#') != -1 and token[0] != '"':
                nline += token[:token.find('#')]
                break
            else:
                nline += token
        filelines.append(nline)

Is there a way to find the first hash not within quotes without for loops (i.e. through regular expressions?)

Examples:

' "Phone #":"555-1234" ' -> ' "Phone #":"555-1234" '
' "Phone "#:"555-1234" ' -> ' "Phone "'
'#"Phone #":"555-1234" ' -> ''
' "Phone #":"555-1234" #Comment' -> ' "Phone #":"555-1234" '

Edit: Here is a pure regex solution created by user2357112. I tested it, and it works great:

filelines = []
r = re.compile('(?:"[^"]*"|[^"#])*(#)')
for line in f:
    m = r.match(line)
    if m != None:
        filelines.append(line[:m.start(1)])
    else:
        filelines.append(line)

See his reply for more details on how this regex works.

Edit2: Here's a version of user2357112's code that I modified to account for escape characters (\"). This code also eliminates the 'if' by including a check for end of string ($):

filelines = []
r = re.compile(r'(?:"(?:[^"\\]|\\.)*"|[^"#])*(#|$)')
for line in f:
    m = r.match(line)
    filelines.append(line[:m.start(1)])

回答1:

r'''(?:        # Non-capturing group
      "[^"]*"  # A quote, followed by not-quotes, followed by a quote
      |        # or
      [^"#]    # not a quote or a hash
    )          # end group
    *          # Match quoted strings and not-quote-not-hash characters until...
    (#)        # the comment begins!
'''

This is a verbose regex, designed to operate on a single line, so make sure to use the re.VERBOSE flag and feed it one line at a time. It'll capture the first unquoted hash as group 1 if there is one, so you can use match.start(1) to get the index. It doesn't handle backslash escapes, if you want to be able to put a backslash-escaped quote in a string. This is untested.

回答2:

You can remove comments using this script:

import re
print re.sub(r'("(?:[^"]+|(?<=\\)")*")|#[^\n]*', lambda m: m.group(1) or '', '"Phone #"#:"555-1234"')

The idea is to capture a part in double-quotes and to replace it by itself before searching a sharp:

(                 # open the capture group 1
    "             # " 
    (?:           # open a non-capturing group
        [^"]+     # all characters except "
      |           # OR
        (?<=\\)"  # escaped quote
    )*            # repeat zero or more times
    "             # "
)                 # close the capture group 1

|                 # OR

#[^\n]*           # a sharp and zero or one characters that are not a newline.

回答3:

This code was so ugly, I had to post it.

def remove_comments(text):
    char_list = list(text)
    in_str = False
    deleting = False
    for i, c in enumerate(char_list):
        if deleting:
            if c == '\n':
                deleting = False
            else:
                char_list[i] = None
        elif c == '"':
            in_str = not in_str
        elif c == '#':
            if not in_str:
                deleting = True
                char_list[i] = None
    char_list = filter(lambda x: x is not None, char_list)
    return ''.join(char_list)

Seems to work though. Although I'm not sure how it might handle newline chars between windows and linux.

来源：https://stackoverflow.com/questions/17791143/removing-hash-comments-that-are-not-inside-quotes

标签

python

regex

comments

quotes

strip