How to replace all unicode characters except for Spanish ones?

十年热恋 提交于 2020-06-23 07:38:27

问题


I am trying to remove all Unicode characters from a file except for the Spanish characters.

Matching the different vowels has not been any issue and áéíóúÁÉÍÓÚ are not replaced using the following regex (but all other Unicode appears to be replaced):

perl -pe 's/[^áéíóúÁÉÍÓÚ[:ascii:]]//g;' filename

But when I add the inverted question mark ¿ or exclamation mark ¡ to the regex other Unicode characters are also being matched and excluded that I would like to be removed:

perl -pe 's/[^áéíóúÁÉÍÓÚ¡¿[:ascii:]]//g;' filename does not replace the following (some are not printable): ³ � � ­

Am I missing something obvious here? I am also open to other ways of doing this on the terminal.


回答1:


You have a UTF8 encoded file and work with Unicode chars, thus, you need to pass specific set of options to let Perl know of that.

You should add -Mutf8 to let Perl recognize the UTF8-encoded characters used directly in your Perl code.

Also, you need to pass -CSD (equivalent to -CIOED) in order to have your input decoded and output re-encoded. This value is encoding dependent, it will work for UTF8 encoding.

perl -CSD -Mutf8 -pe 's/[^áéíóúñüÁÉÍÓÚÑÜ¡¿[:ascii:]]//g;' filename

Do not forget about Ü and ü.



来源:https://stackoverflow.com/questions/54507064/how-to-replace-all-unicode-characters-except-for-spanish-ones

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!