python re can't find this grouped name

ⅰ亾dé卋堺 提交于 2019-12-11 05:25:40

问题


I try to give advice on the format of paper reference. For example, for academic dissertation, the format is:

author. dissertation name[D]. place where store it: organization who hold the copy, year in which the dissertation published.

obviously, there may be some punctuation in every items except for year. for example

Smith. The paper name. The subtitle of paper[D]. United States: MIT, 2011

often, place where store it and year are missed, for example

Smith. The paper name. The subtitle of paper[D]. US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT

I want to program like this:

import re
reObj = re.compile(
r'.*\[D\]\.  \s*  ((?P<PLACE>[^:]*):){0,1} \s*   (?P<HOLDER>[^:]*)   (?P<YEAR>,\s*(1|2)\d{3}){0,1}',
re.VERBOSE
)

txt = '''Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
Smith. The paper name. The subtitle of paper[D]. US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT'''.split('\n')

for i in txt:
    if reObj.search(i):
        if reObj.search(i).group('PLACE')==None:
            print('missing place')

        if reObj.search(i).group('YEAR')==None:
            print('missing year')
    else:
        print('bad formation')

but I found that no YEAR are gotten for i in txt: print(i) print(reObj.search(i).group('HOLDER'))

outputs

Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
MIT, 2011
Smith. The paper name. The subtitle of paper[D]. US, 2011
US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT
MIT

for i in txt:
    print(i)
    print(reObj.search(i).group('YEAR'))

outputs

Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
None
Smith. The paper name. The subtitle of paper[D]. US, 2011
None
Smith. The paper name. The subtitle of paper[D]. US: MIT
None

So, why my named group fails and how to fix it? thanks


回答1:


It seems to me you may use

reObj = re.compile("""
    \[D\]\.  \s*            # [D]. and 0+ whitespaces
    (?:                     # An  optional alternation group
     (?P<PLACE>[^,:]*)      # Group "PLACE": 0+ chars other than , and :
       (?:                           # An optional sequence of
          : \s* (?P<HOLDER>[^,:]*)   # :, 0+ whitespaces, Group "HOLDER" (0+ non-colons and non-commas)
        )?
        (?:                          # An optional sequence of
          ,\s* (?P<YEAR>[12]\d{3})   # , + 0+ whitespaces, Group "YEAR" (1 or 2 and then three digits
        )?                       
    )?      
    $          # end of string
    """, flags=re.X)

See the regex and Python demos:

import re
reObj = re.compile(
    r"\[D\]\.\s*(?:(?P<PLACE>[^,:]*)(?::\s*(?P<HOLDER>[^,:]*))?(?:,\s*(?P<YEAR>[12]\d{3}))?)?$",
    re.VERBOSE
)
txt = '''Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
Smith. The paper name. The subtitle of paper[D]. US, 2011
Smith. The paper name. The subtitle of paper[D]. US: MIT'''.split('\n')

for i in txt:
    print('------------------------\nTESTING {}'.format(i))
    m = reObj.search(i)
    if m:
        if not m.group('PLACE'):
            print('missing place')
        else:
            print(m.group('PLACE'))

    if not m.group('YEAR'):
        print('missing year')
    else:
        print(m.group('YEAR'))

Output:

------------------------
TESTING Smith. The paper name. The subtitle of paper[D]. US: MIT, 2011
US
2011
------------------------
TESTING Smith. The paper name. The subtitle of paper[D]. US, 2011
US
2011
------------------------
TESTING Smith. The paper name. The subtitle of paper[D]. US: MIT
US
missing year


来源:https://stackoverflow.com/questions/50946163/python-re-cant-find-this-grouped-name

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!