How to determine if a String contains invalid encoded characters

前端 未结 10 1332
眼角桃花
眼角桃花 2020-12-02 11:38

Usage scenario

We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the webs

10条回答
  •  执念已碎
    2020-12-02 12:03

    I've been working on a similar "guess the encoding" problem. The best solution involves knowing the encoding. Barring that, you can make educated guesses to distinguish between UTF-8 and ISO-8859-1.

    To answer the general question of how to detect if a string is properly encoded UTF-8, you can verify the following things:

    1. No byte is 0x00, 0xC0, 0xC1, or in the range 0xF5-0xFF.
    2. Tail bytes (0x80-0xBF) are always preceded by a head byte 0xC2-0xF4 or another tail byte.
    3. Head bytes should correctly predict the number of tail bytes (e.g., any byte in 0xC2-0xDF should be followed by exactly one byte in the range 0x80-0xBF).

    If a string passes all those tests, then it's interpretable as valid UTF-8. That doesn't guarantee that it is UTF-8, but it's a good predictor.

    Legal input in ISO-8859-1 will likely have no control characters (0x00-0x1F and 0x80-0x9F) other than line separators. Looks like 0x7F isn't defined in ISO-8859-1 either.

    (I'm basing this off of Wikipedia pages for UTF-8 and ISO-8859-1.)

提交回复
热议问题