Name Extraction - CV/Resume - Stanford NER/OpenNLP

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-01 14:51:53

My 2cents on the problem.

So sticking to the NER taggers you listed above will be my first block in the pipeline, if I can identify things there, Viola, no need to go ahead if not then I suggest you go for a Rule Based Approach. When we speak about a resume, name of a candidate is generally in the top 10% lines of the resume. In many cases it is also followed by "Name : Ankit Solanki " If that fails try to find the email address and match that with different NP pairs you get from other text in the resume, the one with the closest match you find should be your name, As in majority of cases email address of people for professional purpose like a resume will have their name, example john.mayer89abc.com will get cleaned to john.mayer which in turn goes through a algo which will find the closest noun phrase to the cleaned email name.

Let me know your thoughts on this.

Best,

Ankit

I guess you'll probably improve name identification if you create a CV corpus, this also depends on the size of your corpus (you could gather such a corpus by crawling CV websites).

Using data mining is probably, in my opinion, your best option. I don't know in details what options are proposed by Apache Tika, but the more information you have on the layout of the CV, the better. For instance, patterns should probably rely on the fact that names are on top of the document, and close to birth date / marital status / image / address.

In that case, you won't be any more in a sequential labelling case (as Stanford NER does): in a CV, a name is usually not surrounded by text. It should most probably be a classification task of candidates segments of text to which patterns may be converted as (numeric or binary) attributes.

Pattern extractor may easily be found or implemented and should be considered as a preprocessing before machine learning. Don't forget, indeed, to also use lists of first and last names (and frequent prefixes / suffixes : -son, -vitch, -man, Ben-, de, etc.) that are indeed unavoidable criteria to decide what segment is likely to be a name. As other names often appear in a CV, this is why I believe using layout should also be an important feature.

I'd be interested to know what features are efficient... would you let us know?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!