Converting combining diacritics to simple utf

試著忘記壹切 提交于 2019-12-08 20:21:18

问题


I have a problem when inserting a string to database due to some encoding issues.

String source is a external rss feed. In web browser it looks ok. Even in debugger the text appears to be ok. If I copy the strong to notedpad, the result is also ok.

But in notepad++ was possible to see that string is using combining characters. If changing to ansii, both combined appears. e.g.

á is displayed as a´

(In notepad++ is is like having two chars, on over the other. I even can select ... half of the char)

I googled a lot and tried very different approach to this problem. I really want to find a clever way of convert string with combining diacritics to simple utf8 database compatible ones.

Any help? Thank you so much!


回答1:


This should work for you

output.Normalize(NormalizationForm.FormC)

This little test gave 3, 2, 3. The middle string is correctly combining A and it's diacritic into a single UTF-8 character

Console.WriteLine(Encoding.UTF8.GetByteCount(("A\u0302")));    
Console.WriteLine(Encoding.UTF8.GetByteCount(("A\u0302").Normalize(NormalizationForm.FormC)));
Console.WriteLine(Encoding.UTF8.GetByteCount(("T\u0302").Normalize(NormalizationForm.FormC)));



回答2:


My Mac can solve this running the following Command in Terminal:

iconv -f utf-8-mac -t utf-8 inputfile >outputfile



来源:https://stackoverflow.com/questions/20889305/converting-combining-diacritics-to-simple-utf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!