Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars

后端未结

关注

 12  841

故里飘歌 2020-11-22 11:42

I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their \"simple\" character.

For example:

12条回答

耶瑟儿～ (楼主)

2020-11-22 12:01
The easiest way (to me) would be to simply maintain a sparse mapping array which simply changes your Unicode code points into displayable strings.

Such as:
```
start    = 0x00C0
size     = 23
mappings = {
    "A","A","A","A","A","A","AE","C",
    "E","E","E","E","I","I","I", "I",
    "D","N","O","O","O","O","O"
}
start    = 0x00D8
size     = 6
mappings = {
    "O","U","U","U","U","Y"
}
start    = 0x00E0
size     = 23
mappings = {
    "a","a","a","a","a","a","ae","c",
    "e","e","e","e","i","i","i", "i",
    "d","n","o","o","o","o","o"
}
start    = 0x00F8
size     = 6
mappings = {
    "o","u","u","u","u","y"
}
: : :
```
The use of a sparse array will allow you to efficiently represent replacements even when they in widely spaced sections of the Unicode table. String replacements will allow arbitrary sequences to replace your diacritics (such as the æ grapheme becoming ae).

This is a language-agnostic answer so, if you have a specific language in mind, there will be better ways (although they'll all likely come down to this at the lowest levels anyway).
0 讨论(0)

查看其它12个回答
发布评论:

提交评论
- 加载中...