问题
I am trying to get the German Handelsregisternummer (companies' register number) which usually is directly written behind the word HRB
. However there are exceptions which I would like to catch with my regex. The goal is to call the function and set the keyword (in this case it is HRB
). Then the function returns the number. Please see regex demo!
This is what I have so far! This doesn't catch all cases.
def get_company_register_number(string, keyword):
reg_1 = fr'\b{keyword}\b[,:|\s]*(\w+)'
match = re.compile(reg_1)
company_register_number = match.findall(string) # list of matched words
if company_register_number: # not empty
return company_register_number
else: # no match found
company_register_number = []
return company_register_number
string = "HRB: 21156"
get_company_register_number(string, 'HRB')
>>>>>> ['21156']
回答1:
You could extend the character class and move the word boundary to before matching digits.
\bHRB[.,: \w-]*\b(\d+)
See the updated regex
Or a bit more precise match:
\bHRB[,:]?(?:[- ](?:Nr|Nummer)[.:]*)? (\d+)
\bHRB
Word boundary, then match HRB[,:]?
Optionally match,
or:
(?:
Non capture group[- ](?:Nr|Nummer)[.:]*
Match space or-
, then Nr or Nummer and 0+ times a . or :
)?
Close the group and make it optional(\d+)
Match a space and capture in the first group 1 or more digits
Regex demo
回答2:
You may use
\bHRB\b(?:[-\s]N(?:umme)?r)?[,.:\s]*(\d+)
See the regex demo
Details
\bHRB\b
- a whole wordHRB
(?:[-\s]N(?:umme)?r)?
- an optional group matching-
or whitespace and thenNr
orNummer
[,.:\s]*
- 0 or more commas, dots, colons or whitespaces(\d+)
- Group 1: one or more digits.
See a Python demo:
import re
strings = ['HRB 21156','HRB, 1234','HRB: 99887','HRB-Nummer 21156','HRB-Nr. 12345','HRB-Nr: 21156','HRB Nr. 21156','HRB Nr: 21156','HRB Nr.: 21156','HRB Nummer 21156', 'no number here']
def get_company_register_number(string, keyword):
return re.findall(fr'\b{keyword}\b(?:[-\s]N(?:umme)?r)?[,.:\s]*(\d+)', string)
for s in strings:
print(s, '=>', get_company_register_number(s, 'HRB'))
Output:
HRB 21156 => ['21156']
HRB, 1234 => ['1234']
HRB: 99887 => ['99887']
HRB-Nummer 21156 => ['21156']
HRB-Nr. 12345 => ['12345']
HRB-Nr: 21156 => ['21156']
HRB Nr. 21156 => ['21156']
HRB Nr: 21156 => ['21156']
HRB Nr.: 21156 => ['21156']
HRB Nummer 21156 => ['21156']
no number here => []
来源:https://stackoverflow.com/questions/63886756/extract-companies-register-number-in-python-by-getting-the-next-word