How can I generate a non-UTF-8 Character Set

Deadly 提交于 2020-01-02 00:50:31

问题


One of my requirement says "Text Box Name should accept only UTF-8 Character set". I want to perform a negative test by entering a non UTF-8 character set. How can I do this?


回答1:


If you are asking how to construct a non-UTF-8 character, that should be easy from this definition from Wikipedia:

For code points U+0000 through U+007F, each codepoint is one byte long and looks like this:

0xxxxxxx   // a

For code points U+0080 through U+07FF, each codepoint is two bytes long and look like this:

110xxxxx 10xxxxxx  // b

And so on.

So, to construct an illegal UTF-8 character that is one byte long, the highest bit must be 1 (to be different from pattern a) and the second highest bit must be 0 (to be different from pattern b):

10xxxxxx

or

111xxxxx

Which also differs from both patterns.

With the same logic, you can construct illegal codeunit sequences which are more than two bytes long.

You did not tag a language, but I had to test it, so I used Java:

for (int i=0;i<255;i++) {
    System.out.println( 
        i + " " + 
        (byte)i + " " + 
        Integer.toHexString(i) + " " + 
        String.format("%8s", Integer.toBinaryString(i)).replace(' ', '0') + " " + 
        new String(new byte[]{(byte)i},"UTF-8")
    );
}

0 to 31 are non-printable characters, then 32 is space, followed by printable characters:

...
31 31 1f 00011111 
32 32 20 00100000  
33 33 21 00100001 !
...
126 126 7e 01111110 ~
127 127 7f 01111111 
128 -128 80 10000000 �

delete is 0x7f and after it, from 128 inclusively up to 254 no valid characters are printed. You can see from the UTF-8 chartable also:

Codepoint U+007F is represented with one byte 0x7F (bits 01111111), while codepoint U+0080 is represented with two bytes 0xC2 0x80 (bits 11000010 10000000).

If you are not familiar with UTF-8 I strongly recommend reading this excellent article:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)



来源:https://stackoverflow.com/questions/16031620/how-can-i-generate-a-non-utf-8-character-set

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!