Extract business titles and time periods from string

问题

I am extracting information about certain companies from Reuters using Python. I have been able to get the officer/executive names, biographies, and compensation from this page

Now, I want to extract previous position titles and companies from the biography section, which looks something like this:

Mr. Donald T. Grimes is Senior Vice President, Chief Financial Officer and Treasurer of Wolverine World Wide, Inc., since May 2008. From 2007 to 2008, he was the Executive Vice President and Chief Financial Officer for Keystone Automotive Operations, Inc., a distributor of automotive accessories and equipment. Prior to Keystone, Mr. Grimes held a series of senior corporate and divisional finance roles at Brown-Forman Corporation, a manufacturer and marketer of premium wines and spirits. During his employment at Brown-Forman, Mr. Grimes was Vice President, Director of Beverage Finance from 2006 to 2007; Vice President, Director of Corporate Planning and Analysis from 2003 to 2006; and Senior Vice President, Chief Financial Officer of Brown-Forman Spirits America from 1999 to 2003.

I can use simple regex to get the from and to years, but I am at a loss on how to write regex to get the titles and the company name as well. I know the string format is inconsistent, so I would take an answer that works for at least 70% of cases. Here's the output I would like:

2007-2008, executive vice president and chief financial officer, Keystone Automotive operations

回答1:

The problem you are trying to solve is well known and researched, and you will find a large amount of research paper describing approaches and algorithms if you google for the terms "Named Entity Extraction" and "Relationship Extraction" Some good starting points are:

Chapter 7 of the book "Natural Language Processing with Python", in fact that entire book would probably be helpful. Chapter online here
This paper on "Named Entity Relation Mining using Wikipedia"
This paper "ddNovel Algorithms for Relationship Mining which describes mining job titles and organizations as one of the examples.

These are just a few links I've found interesting, there are a ton more and probably better ones than these, but this should get you started.

回答2:

I don't think there is going to be a single regex that you can use for this, unless it's really nasty. I think the solution to this might be Natural Language Processing. Certainly there are packages for this, but using them might not be simple.

Essentially you want to take a sentence like "X is/was Y", and figure out which part is a name, which part is a list of job titles, and which parts are irrelevant. Maybe look for sequences of words that are either capitalized or small words like "and" and "of"?

(?:\u\w+)( (?:\u\w*)|(?:of)|(?:and))*  #Note the space

The \u means that the next single character (the first character of the \w+ group) is uppercase. Haven't tested it, but it seems like it should work. This may be a non-trivial problem.

来源：https://stackoverflow.com/questions/7757554/extract-business-titles-and-time-periods-from-string

标签

python

regex

nlp