Extract companies' register number in Python by getting the next word

我是研究僧i 提交于 2021-01-29 06:54:40

问题


I am trying to get the German Handelsregisternummer (companies' register number) which usually is directly written behind the word HRB. However there are exceptions which I would like to catch with my regex. The goal is to call the function and set the keyword (in this case it is HRB). Then the function returns the number. Please see regex demo!

This is what I have so far! This doesn't catch all cases.

def get_company_register_number(string, keyword):

  reg_1 = fr'\b{keyword}\b[,:|\s]*(\w+)' 

  match = re.compile(reg_1)
  company_register_number = match.findall(string) # list of matched words

  if company_register_number: # not empty 
    return company_register_number
  else: # no match found
    company_register_number = []

  return company_register_number


string = "HRB: 21156"
get_company_register_number(string, 'HRB')
>>>>>> ['21156']

回答1:


You could extend the character class and move the word boundary to before matching digits.

\bHRB[.,: \w-]*\b(\d+)

See the updated regex

Or a bit more precise match:

\bHRB[,:]?(?:[- ](?:Nr|Nummer)[.:]*)? (\d+)
  • \bHRB Word boundary, then match HRB
  • [,:]? Optionally match , or :
  • (?: Non capture group
    • [- ](?:Nr|Nummer)[.:]* Match space or -, then Nr or Nummer and 0+ times a . or :
  • )? Close the group and make it optional
  • (\d+) Match a space and capture in the first group 1 or more digits

Regex demo




回答2:


You may use

\bHRB\b(?:[-\s]N(?:umme)?r)?[,.:\s]*(\d+)

See the regex demo

Details

  • \bHRB\b - a whole word HRB
  • (?:[-\s]N(?:umme)?r)? - an optional group matching - or whitespace and then Nr or Nummer
  • [,.:\s]* - 0 or more commas, dots, colons or whitespaces
  • (\d+) - Group 1: one or more digits.

See a Python demo:

import re

strings = ['HRB 21156','HRB, 1234','HRB: 99887','HRB-Nummer 21156','HRB-Nr. 12345','HRB-Nr: 21156','HRB Nr. 21156','HRB Nr: 21156','HRB Nr.: 21156','HRB Nummer 21156', 'no number here']

def get_company_register_number(string, keyword):
  return re.findall(fr'\b{keyword}\b(?:[-\s]N(?:umme)?r)?[,.:\s]*(\d+)', string)

for s in strings:
  print(s, '=>', get_company_register_number(s, 'HRB'))

Output:

HRB 21156 => ['21156']
HRB, 1234 => ['1234']
HRB: 99887 => ['99887']
HRB-Nummer 21156 => ['21156']
HRB-Nr. 12345 => ['12345']
HRB-Nr: 21156 => ['21156']
HRB Nr. 21156 => ['21156']
HRB Nr: 21156 => ['21156']
HRB Nr.: 21156 => ['21156']
HRB Nummer 21156 => ['21156']
no number here => []


来源:https://stackoverflow.com/questions/63886756/extract-companies-register-number-in-python-by-getting-the-next-word

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!