Regex for name extraction on text file

落花浮王杯 提交于 2021-02-17 02:37:06

问题


I've got a plain text file containing a list of authors and abstracts and I'm trying to extract just the author names to use for network analysis. My text follows this pattern and contains 500+ abstracts:

2010 - NUCLEAR FORENSICS OF SPECIAL NUCLEAR MATERIAL AT LOS ALAMOS: THREE RECENT STUDIES 

Purchase this article

David L. Gallimore, Los Alamos National Laboratory

Katherine Garduno, Los Alamos National Laboratory

Russell C. Keller, Los Alamos National Laboratory

Nuclear forensics of special nuclear materials is a highly specialized field because there are few analytical laboratories in the world that can safely handle nuclear materials, perform high accuracy and precision analysis using validated analytical methods.

I'm using Python 2.7.6 with the re library.

I've tried

regex = re.compile(r'( [A-Z][a-z]*,+)')
print regex.findall(text)

Which pulls out the last names only, plus any capitalized words prior to commas in the abstracts.

Using (r'.*,') works perfectly to extract the full name, but also grabs the entire abstract which I don't need.

Maybe regex is the wrong approach? Any help or ideas are appreciated.


回答1:


If you are trying to match the names, I would try to match the entire substring instead of part of it.

You could use the following regular expression and modify it if needed.

>>> regex = re.compile(r'\b([A-Z][a-z]+(?: [A-Z]\.)? [A-Z][a-z]+),')
>>> print regex.findall(text)
['David L. Gallimore', 'Katherine Garduno', 'Russell C. Keller']

Working Demo | Explanation




回答2:


try this one

[A-Za-z]* ?([A-Za-z]+.) [A-Za-z]*(?:,+)

It makes the middle name optional, plus it excludes the comma from the result by putting it in a non capturing group



来源:https://stackoverflow.com/questions/26188295/regex-for-name-extraction-on-text-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!