发表新帖

发表新帖

How to determine if a String contains invalid encoded characters

前端未结

关注

 10  1332

眼角桃花 2020-12-02 11:38

Usage scenario

We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the webs

10条回答

执念已碎 (楼主)

2020-12-02 12:03
I've been working on a similar "guess the encoding" problem. The best solution involves knowing the encoding. Barring that, you can make educated guesses to distinguish between UTF-8 and ISO-8859-1.

To answer the general question of how to detect if a string is properly encoded UTF-8, you can verify the following things:
1. No byte is 0x00, 0xC0, 0xC1, or in the range 0xF5-0xFF.
2. Tail bytes (0x80-0xBF) are always preceded by a head byte 0xC2-0xF4 or another tail byte.
3. Head bytes should correctly predict the number of tail bytes (e.g., any byte in 0xC2-0xDF should be followed by exactly one byte in the range 0x80-0xBF).
If a string passes all those tests, then it's interpretable as valid UTF-8. That doesn't guarantee that it is UTF-8, but it's a good predictor.

Legal input in ISO-8859-1 will likely have no control characters (0x00-0x1F and 0x80-0x9F) other than line separators. Looks like 0x7F isn't defined in ISO-8859-1 either.

(I'm basing this off of Wikipedia pages for UTF-8 and ISO-8859-1.)
0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...

热议问题