How to check that a regular expression has matched a string completely, i.e. - the string did not contain any extra character?

问题

I have two questions:

1) I have a regular expression ([A-Z][a-z]{0,2})(\d*) and I am using Python's re.finditer() to match appropriate strings. My problem is, that I want to match only strings that contain no extra characters, otherwise I want to raise an exception.

I want to catch a following pattern: - capital letter, followed by 0, 1 or 2 small letters, followed by 0 or more numbers.

The pattern represents a chemical formula, i.e. atom followed by number of it's occurences. I want to put the atom into a dictionary with it's number of occurences, so I need to separate atoms (capital letter followed by 0, 1 or 2 small letters) and numbers, but remember that they belong together.

Example:

C6H5Fe2I   # this string should be matched successfully. Result: C6 H5 Fe2 I
H2TeO4     # this string should be matched successfully Result: H2 Te O4
H3PoooO5   # exception should be raised
C2tH6      # exception should be raised

2) second question is what kind of Exception should I raise in case the input string is wrong.

Thank you, Tomas

回答1:

Here's a few different approaches you could use:

Compare lengths

Find the length of the original string.
Sum the length of the matched strings.
If the two numbers differ there were unused characters.

Note that you can also combine this method with your existing code rather than doing it as an extra step if you want to avoid parsing the string twice.

Regular expression for entire string

You can check if this regular expression matches the entire string:

^([A-Z][a-z]{0,2}\d*)*$

(Rubular)

Tokenize

You can use the following regular expression to tokenize the original string:

[A-Z][^A-Z]*

Then check each token to see if it matches your original regular expression.

回答2:

capital letter, followed by 0, 1 or 2 small letters, followed by 0 or more numbers

Ok then.

/^([A-Z][a-z]{0,2}\d*)+$/

Difference here being the extra grouping (foo)+ within the ^$ allowing you to capture pattern foo N times.

No global flag? Guess you'll have to split the result of that regex on the pattern again then.

回答3:

Do you need to extract each individual part to process, or simply match for input validation? If you just need to match for validation, try ^([A-Z][a-z]{0,2}\d*)+$.

回答4:

You need slightly different regex:

^([A-Z][a-z]{0,2})(\d*)$

which won't match any of your example strings, however. You need to provide better description of why those strings supposed to match.

Just to test whether the whole string match you could use:

>>> re.match(r'(([A-Z][a-z]{,2})(\d*))+$', 'H2TeO4')
<_sre.SRE_Match object at 0x920f520>
>>> re.match(r'(([A-Z][a-z]{,2})(\d*))+$', 'H3PoooO5')
>>>

I didn't find pure regex solution, but here is how to test and collect matches:

>>> res = re.findall(r'([A-Z][a-z]{,2})(\d*)(?=(?:[A-Z][a-z]{,2}\d*|$))', s)
>>> res
[('C', '6'), ('H', '5'), ('Fe', '2'), ('I', '')]
>>> ''.join(''.join(i) for i in res) == s
True

回答5:

>>> import re
>>> reMatch = re.compile( '([A-Z][a-z]{0,2})(\d*)' )
>>> def matchText ( text ):
        matches, i = [], 0
        for m in reMatch.finditer( text ):
            if m.start() > i:
                break
            matches.append( m )
            i = m.end()
        else:
            if i == len( text ):
                return matches
        raise ValueError( 'invalid text' )

>>> matchText( 'C6H5Fe2I' )
[<_sre.SRE_Match object at 0x021E2800>, <_sre.SRE_Match object at 0x021E28D8>, <_sre.SRE_Match object at 0x021E2920>, <_sre.SRE_Match object at 0x021E2968>]
>>> matchText( 'H2TeO4' )
[<_sre.SRE_Match object at 0x021E2890>, <_sre.SRE_Match object at 0x021E29F8>, <_sre.SRE_Match object at 0x021E2A40>]
>>> matchText( 'H3PoooO5' )
Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    matchText( 'H3PoooO5' )
  File "<pyshell#3>", line 11, in matchText
    raise ValueError( 'invalid text' )
ValueError: invalid text
>>> matchText( 'C2tH6' )
Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    matchText( 'C2tH6' )
  File "<pyshell#3>", line 11, in matchText
    raise ValueError( 'invalid text' )
ValueError: invalid text

To answer your second question a bit more clearly than with the code above: A ValueError is used in cases where a parameter was of the correct type but the value was not right. So for a function that uses a regex, it is obviously the best you can choose.

回答6:

Use this pattern

(([A-Z][a-z]{0,2})(\d*))+

If it matches, great! If not, then handle it. I see no reason to raise an exception if it doesn't match. You'll have to provide more info.

回答7:

My go without regexp:

tests= (
'C6H5Fe2I',   # this string should be matched successfully. Result: C6 H5 Fe2 I
'H2TeO4',     # this string should be matched successfully Result: H2 Te O4
'H3PoooO5',   # exception should be raised
'C2tH6')      # exception should be raised

def splitter(case):
    case, original = list(case), case
    while case:
        if case[0].isupper():
            result = case.pop(0)
        else:
            raise ValueError('%r is not capital letter in %s position %i.' %
                             (case[0], original, len(original)-len(case)))
        for count in range(2):
            if case and case[0].islower():
                result += case.pop(0)
            else:
                break
        for count in range(2):
            if case and case[0].isdigit():
                result += case.pop(0)
            else:
                break
        yield result

for testcase in tests:
    try:
        print tuple(splitter(testcase))
    except ValueError as e:
        print(e)

回答8:

You can do this in not much code with re.split -- yes, that's correct, re.split.

Here are the docs:

Invert your problem: split your input with a delimiter pattern that matches a valid atom+count. Have a capturing group so that the delimiter strings are kept. If the input string is valid, the non-delimiters in the result will all be empty strings.

>>> tests= (
... 'C6H5Fe2I',
... 'H2TeO4',
... 'H3PoooO5',
... 'C2tH6',
... 'Bad\n')
>>> import re
>>> pattern = r'([A-Z][a-z]{0,2}\d*)'
>>> for test in tests:
...     pieces = re.split(pattern, test)
...     print "\ntest=%r pieces=%r" % (test, pieces)
...     data = pieces[1::2]
...     rubbish = filter(None, pieces[0::2])
...     print "rubbish=%r data=%r" % (rubbish, data)
...

test='C6H5Fe2I' pieces=['', 'C6', '', 'H5', '', 'Fe2', '', 'I', '']
rubbish=[] data=['C6', 'H5', 'Fe2', 'I']

test='H2TeO4' pieces=['', 'H2', '', 'Te', '', 'O4', '']
rubbish=[] data=['H2', 'Te', 'O4']

test='H3PoooO5' pieces=['', 'H3', '', 'Poo', 'o', 'O5', '']
rubbish=['o'] data=['H3', 'Poo', 'O5']

test='C2tH6' pieces=['', 'C2', 't', 'H6', '']
rubbish=['t'] data=['C2', 'H6']

test='Bad\n' pieces=['', 'Bad', '\n']
rubbish=['\n'] data=['Bad']
>>>

回答9:

for the Validation please to try if you are using .NET Framework

([A-Z][a-b]??[0-9]??)*

other wise

([A-Z][a-b]?[a-b]?[0-9]?[0-9]?)*

来源：https://stackoverflow.com/questions/3879534/how-to-check-that-a-regular-expression-has-matched-a-string-completely-i-e-t

标签

python

regex

exception-handling

error-handling