How check different spellings of a persons full name

旧巷老猫 提交于 2020-01-16 00:41:13

问题


I try to create a regular expression with searches in a huge document for a persons full name. In the text the name can be written in full, or the first names can be either abbreviated to a single letter or a letter followed by a dot or omitted. For instance my search for _ALBERTO JORGE ALONSO CALEFACCION_now is:

preg_match('/([;:.,&\s\xc2\-(){}!"'<>]{1})(ALBERTO|A.|A)[\s\xc2-]+
(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION))([;:.,&\s\xc2(){}
!"'<>]{1})/i', $text, $match);

Between the first names and last names an asterisk (*) can be present.

This is working for the case all first names are at least present some way. But I don't know to extend the expression when first names are omitted. Can you help me?


回答1:


Let's start by simplifying what you have;

start:

/([;:.,&\s\xc2\-(){}!"'<>]{1})(ALBERTO|A.|A)[\s\xc2-]+(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)([;:.,&\s\xc2(){}!"'<>]{1})/i

as I said in my comment, \b is "word break", so you can simplify a lot of that:

/\b(ALBERTO|A.|A)[\s\xc2-]+(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i

(added bonus: it won't match the characters either side now, and it will match at the start and end of the text)

Next, you can use the ? token for the dots (which should be escaped by the way; . is special and means "match anything")

/\b(ALBERTO|A\.?)[\s\xc2-]+(JORGE|J\.?)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i

Finally, to actually answer your question, you have 2 choices. Either make the entire bracketed name optional, or add a new blank option. The first is the most flexible, since we'll need to cope with the whitespace too:

/\b((ALBERTO|A\.?)[\s\xc2-]+((JORGE|J\.?)[\s\xc2,]+)?)?(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i

Note that if you're reading the matched parts you'll need to update your indices. Also note that this fixed an issue where omitting the second name (JORGE) still required an extra space.

This will match things like A. J. ALONSO CALEFACCION, A. ALONSO CALEFACCION and ALONSO CALEFACCION, but not J. ALONSO CALEFACCION (it's only a small tweak if you do want that)

Breaking up that final string for clarity:

/\b
(
    (ALBERTO|A\.?)[\s\xc2-]+
    (
        (JORGE|J\.?)[\s\xc2,]+
    )?
)?
(ALONSO)[\s\xc2*-]+
(CALEFACCION)
\b/i

Finally, it's an odd thought, but you could change the names which can be initials to be in this form: (A(LBERTO|\.|)), which means you're not repeating the initials (a potential source of mistakes)



来源:https://stackoverflow.com/questions/18773711/how-check-different-spellings-of-a-persons-full-name

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!