Regex for Matching Pinyin

强颜欢笑 提交于 2019-12-01 04:17:32

I went for a regex that grouped smaller regexes by the pinyin's initial (usually the first letter). So, the first group includes all "b", "p" and "m" sounds, then "f", then "d" and "t", etc.

This approach seems easy to read and should be easy to edit (if it needs corrections or additions). I also added exceptions to the begging of groups in order to improve readability.

([mM]iu|[pmPM]ou|[bpmBPM](o|e(i|ng?)?|a(ng?|i|o)?|i(e|ng?|a[no])?|u))| ([fF](ou?|[ae](ng?|i)?|u))|([dD](e(i|ng?)|i(a[on]?|u))| [dtDT](a(i|ng?|o)?|e(i|ng)?|i(a[on]?|e|ng|u)?|o(ng?|u)|u(o|i|an?|n)?))| ([nN]eng?|[lnLN](a(i|ng?|o)?|e(i|ng)?|i(ang|a[on]?|e|ng?|u)?|o(ng?|u)|u(o|i|an?|n)?|ve?))| ([ghkGHK](a(i|ng?|o)?|e(i|ng?)?|o(u|ng)|u(a(i|ng?)?|i|n|o)?))| ([zZ]h?ei|[czCZ]h?(e(ng?)?|o(ng?|u)?|ao|u?a(i|ng?)?|u?(o|i|n)?))| ([sS]ong|[sS]hua(i|ng?)?|[sS]hei|[sS][h]?(a(i|ng?|o)?|en?g?|ou|u(a?n|o|i)?|i))| ([rR]([ae]ng?|i|e|ao|ou|ong|u[oin]|ua?n?))| ([jqxJQX](i(a(o|ng?)?|[eu]|ong|ng?)?|u(e|a?n)?))| (([aA](i|o|ng?)?|[oO]u?|[eE](i|ng?|r)?))| ([wW](a(i|ng?)?|o|e(i|ng?)?|u))| [yY](a(o|ng?)?|e|in?g?|o(u|ng)?|u(e|a?n)?) 

Here's the Debuggex example I created.

I would use a combination approach that is not solely regex.

Check for valid pinyin:

  1. grab word

  2. grab letters from the beginning of the word as long as they are consonants. This separates the initial sound from the final sound.

  3. check that the initial and final are valid...

  4. ...and if so, see if their combination is allowed (via a table like this, but the entries are simply 1's and 0's).

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!