Strange utf8 decoding error in windows notepad

问题

If you type the following string into a text file encoded with utf8(without bom) and open it with notepad.exe,you will get some weired characters on screen. But notepad can actually decode this string well without the last 'a'. Very strange behavior. I am using Windows 10 1809.

[19, 16, 12, 14, 15, 15, 12, 17, 18, 15, 14, 15, 19, 13, 20, 18, 16, 19, 14, 16, 20, 16, 18, 12, 13, 14, 15, 20, 19, 17, 14, 17, 18, 16, 13, 12, 17, 14, 16, 13, 13, 12, 15, 20, 19, 15, 19, 13, 18, 19, 17, 14, 17, 18, 12, 15, 18, 12, 19, 15, 12, 19, 18, 12, 17, 20, 14, 16, 17, 18, 15, 12, 13, 19, 18, 17, 18, 14, 19, 18, 16, 15, 18, 17, 15, 15, 19, 16, 15, 14, 19, 13, 19, 15, 17, 16, 12, 12, 18, 12, 14, 12, 16, 19, 12, 19, 12, 17, 19, 20, 19, 17, 19, 20, 16, 19, 16, 19, 16, 12, 12, 18, 19, 17, 18, 16, 12, 17, 13, 18, 20, 19, 18, 20, 14, 16, 13, 12, 12, 14, 13, 19, 17, 20, 18, 15, 12, 15, 20, 14, 16, 15, 16, 19, 20, 20, 12, 17, 13, 20, 16, 20, 13a

I wonder if this is a windows bug or there is something I can do to solve this.

回答1:

Did more research; figured it out.

Seems like a variation of the classic case of "Bush hid the facts". https://en.wikipedia.org/wiki/Bush_hid_the_facts

It looks like Notepad has a different character encoding default for saving a file than it does for opening a file. Yes, this does seem like a bug.

But there is an actual explanation for what is occurring:

Notepad checks for a BOM byte sequence. If it does not find one, it has 2 options: the encoding is either UTF-16 Little Endian (without BOM) or plain ASCII. It checks for UTF-16 LE first using a function called IsTextUnicode.
IsTextUnicode runs a series of tests to guess whether the given text is Unicode or not. One of these tests is IS_TEXT_UNICODE_STATISTICS, which uses statistical analysis. If the test is true, then the given text is probably Unicode, but absolute certainty is not guaranteed.
https://docs.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-istextunicode
If IsTextUnicode returns true, Notepad encodes the file with UTF-16 LE, producing the strange output you saw. We can confirm this with this character ㄠ. Its corresponding ASCII characters are ' 1' (space one); the corresponding hex values for those ASCII characters are 0x20 for space and 0x31 for one. Since the byte-ordering is Little Endian, the order for the Unicode code point would be '1 ', or U+3120, which you can confirm if you look up that code point.
https://unicode-table.com/en/3120/

If you want to solve the issue, you need to break the pattern which helps IsTextUnicode determine if the given text is Unicode. You can insert a newline before the text to break the pattern.

Hope that helped!

来源：https://stackoverflow.com/questions/55690349/strange-utf8-decoding-error-in-windows-notepad

标签

windows

utf-8

character-encoding

notepad