问题
I try to create a regular expression with searches in a huge document for a persons full name. In the text the name can be written in full, or the first names can be either abbreviated to a single letter or a letter followed by a dot or omitted. For instance my search for _ALBERTO JORGE ALONSO CALEFACCION_now is:
preg_match('/([;:.,&\s\xc2\-(){}!"'<>]{1})(ALBERTO|A.|A)[\s\xc2-]+
(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION))([;:.,&\s\xc2(){}
!"'<>]{1})/i', $text, $match);
Between the first names and last names an asterisk (*) can be present.
This is working for the case all first names are at least present some way. But I don't know to extend the expression when first names are omitted. Can you help me?
回答1:
Let's start by simplifying what you have;
start:
/([;:.,&\s\xc2\-(){}!"'<>]{1})(ALBERTO|A.|A)[\s\xc2-]+(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)([;:.,&\s\xc2(){}!"'<>]{1})/i
as I said in my comment, \b
is "word break", so you can simplify a lot of that:
/\b(ALBERTO|A.|A)[\s\xc2-]+(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
(added bonus: it won't match the characters either side now, and it will match at the start and end of the text)
Next, you can use the ?
token for the dots (which should be escaped by the way; .
is special and means "match anything")
/\b(ALBERTO|A\.?)[\s\xc2-]+(JORGE|J\.?)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
Finally, to actually answer your question, you have 2 choices. Either make the entire bracketed name optional, or add a new blank option. The first is the most flexible, since we'll need to cope with the whitespace too:
/\b((ALBERTO|A\.?)[\s\xc2-]+((JORGE|J\.?)[\s\xc2,]+)?)?(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
Note that if you're reading the matched parts you'll need to update your indices. Also note that this fixed an issue where omitting the second name (JORGE) still required an extra space.
This will match things like A. J. ALONSO CALEFACCION
, A. ALONSO CALEFACCION
and ALONSO CALEFACCION
, but not J. ALONSO CALEFACCION
(it's only a small tweak if you do want that)
Breaking up that final string for clarity:
/\b
(
(ALBERTO|A\.?)[\s\xc2-]+
(
(JORGE|J\.?)[\s\xc2,]+
)?
)?
(ALONSO)[\s\xc2*-]+
(CALEFACCION)
\b/i
Finally, it's an odd thought, but you could change the names which can be initials to be in this form: (A(LBERTO|\.|))
, which means you're not repeating the initials (a potential source of mistakes)
来源:https://stackoverflow.com/questions/18773711/how-check-different-spellings-of-a-persons-full-name