Errors when trying to remove parentheses in python text

旧街凉风 提交于 2019-12-13 13:43:06

问题


I've been working on a bit of code to take a bunch of histograms from other files and plot them together. In order to make sure the legend displays correctly I've been trying to take the titles of these original histograms and cut out a bit of information that isn't needed any more.

The section I don't need takes the form (A mass=200 GeV), I've had no problem removing what's inside the parentheses, unfortunately everything I've tried for the parentheses themselves either has no effect, negates the code that removes the text, or throws errors.

I've tried using suggestions from; Remove parenthesis and text in a file using Python and How can I remove text within parentheses with a regex?

The error my current attempt gives me is

'str' object cannot be interpreted as an integer

This is the section of the code:

histo_name = ''

# this is a list of things we do not want to show up in our legend keys
REMOVE_LIST = ["(A mass = 200 GeV)"]

# these two lines use the re module to remove things from a piece of text
# that are specified in the remove list
remove = '|'.join(REMOVE_LIST)
regex = re.compile(r'\b('+remove+r')\b')

# Creating the correct name for the stacked histogram
for histo in histos:

    if histo == histos[0]:

        # place_holder contains the edited string we want to set the
        # histogram title to
        place_holder = regex.sub('', str(histo.GetName()))
        histo_name += str(place_holder)
        histo.SetTitle(histo_name)

    else:

        place_holder = regex.sub(r'\(\w*\)', '', str(histo.GetName()))
        histo_name += ' + ' + str(place_holder)
        histo.SetTitle(histo_name)

The if/else bit is just because the first histogram I pass in isn't getting stacked so I just want it to keep it's own name, while the rest are stacked in order hence the '+' etc, but I thought I'd include it.

Apologies if I've done something really obvious wrong, I'm still learning!


回答1:


From the python docs - To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].

So use one of the above patterns instead of the plain brackets in your regex. e.g.REMOVE_LIST = ["\(A mass = 200 GeV\)"]

EDIT: The issue seems to be with your use of \b in the regex - which according to the docs linked above also matches the braces. My seemingly-working example is,

import re

# Test input
myTestString = "someMess (A mass = 200 GeV) and other mess (remove me if you can)"
replaceWith = "HEY THERE FRIEND"

# What to remove
removeList = [r"\(A mass = 200 GeV\)", r"\(remove me if you can\)"]

# Build the regex
remove = r'(' + '|'.join(removeList) + r')'
regex = re.compile(remove)

# Try it!
out = regex.sub(replaceWith, myTestString)

# See if it worked
print(out)



回答2:


There are 2 problems you are facing

  1. You join your strings into a regex pattern without escaping
  2. You are using word boundaries, but some of your entries start/end with a non-word letter (thus, you will never match ) with r"\)\b").

This fixes the first issue, but not the second (it finds More+[fun]+text only):

REMOVE_LIST = ["(A mass = 200 GeV)", "More+[fun]+text"]
remove = '|'.join([re.escape(x) for x in REMOVE_LIST])
ptrn = r'\b(?:'+remove+r')\b'
print ptrn
regex = re.compile(ptrn)
print regex.findall("Now, (A mass = 200 GeV) and More+[fun]+text inside")

You'd need a smarter way to create your pattern. Like this:

import re
REMOVE_LIST = ["(A mass = 200 GeV)", "More+[fun]+text"]

remove_with_boundaries = '|'.join([re.escape(x) for x in REMOVE_LIST if re.match(r'\w', x) and re.search(r'\w$', x)])
remove_with_no_boundaries = '|'.join([re.escape(x) for x in REMOVE_LIST if not re.match(r'\w', x) and not re.search(r'\w$', x)])
remove_with_right_boundaries = '|'.join([re.escape(x) for x in REMOVE_LIST if not re.match(r'\w', x) and re.search(r'\w$', x)])
remove_with_left_boundaries = '|'.join([re.escape(x) for x in REMOVE_LIST if re.match(r'\w', x) and not re.search(r'\w$', x)])

ptrn = ''
if len(remove_with_boundaries) > 0:
    ptrn += r'\b(?:'+remove_with_boundaries+r')\b'
if len(remove_with_left_boundaries) > 0:
    ptrn += r'|\b(?:' + remove_with_left_boundaries + r')'
if len(remove_with_right_boundaries) > 0:
    ptrn += r'|(?:' + remove_with_right_boundaries + r')\b'
if len(remove_with_no_boundaries) > 0:
    ptrn += r'|(?:' + remove_with_no_boundaries + r')'

print ptrn
regex = re.compile(ptrn)
print regex.findall("Now, (A mass = 200 GeV) and More+[fun]+text inside")

See IDEONE demo

For the two ["(A mass = 200 GeV)", "More+[fun]+text"] entries as input, the regex \b(?:More\+\[fun\]\+text)\b|(?:\(A\ mass\ \=\ 200\ GeV\)) is generated and the output is ['(A mass = 200 GeV)', 'More+[fun]+text'].



来源:https://stackoverflow.com/questions/31476415/errors-when-trying-to-remove-parentheses-in-python-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!