Implement a function to check if a string/byte array follows utf-8 format

后端 未结 5 1799
遥遥无期
遥遥无期 2020-12-16 00:20

I am trying to solve this interview question.

After given clearly definition of UTF-8 format. ex: 1-byte : 0b0xxxxxxx 2- bytes:.... Asked to wri

相关标签:
5条回答
  • 2020-12-16 00:50

    A small solution for real world UTF-8 compatibility checking:

    public static final boolean isUTF8(final byte[] inputBytes) {
        final String converted = new String(inputBytes, StandardCharsets.UTF_8);
        final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
        return Arrays.equals(inputBytes, outputBytes);
    }
    

    You can check the tests results:

    @Test
    public void testEnconding() {
    
        byte[] invalidUTF8Bytes1 = new byte[]{(byte)0b10001111, (byte)0b10111111 };
        byte[] invalidUTF8Bytes2 = new byte[]{(byte)0b10101010, (byte)0b00111111 };
        byte[] validUTF8Bytes1 = new byte[]{(byte)0b11001111, (byte)0b10111111 };
        byte[] validUTF8Bytes2 = new byte[]{(byte)0b11101111, (byte)0b10101010, (byte)0b10111111 };
    
        assertThat(isUTF8(invalidUTF8Bytes1)).isFalse();
        assertThat(isUTF8(invalidUTF8Bytes2)).isFalse();
        assertThat(isUTF8(validUTF8Bytes1)).isTrue();
        assertThat(isUTF8(validUTF8Bytes2)).isTrue();
        assertThat(isUTF8("\u24b6".getBytes(StandardCharsets.UTF_8))).isTrue();
    }
    

    Test cases copy from https://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array

    0 讨论(0)
  • 2020-12-16 00:57

    Well, I am grateful for the comments and the answer. First of all, I have to agree that this is "another stupid interview question". It is true that in Java String is already encoded, so it will always be compatible with UTF-8. One way to check it is given a string:

    public static boolean isUTF8(String s){
        try{
            byte[]bytes = s.getBytes("UTF-8");
        }catch(UnsupportedEncodingException e){
            e.printStackTrace();
            System.exit(-1);
        }
        return true;
    }
    

    However, since all the printable strings are in the unicode form, so I haven't got a chance to get an error.

    Second, if given a byte array, it will always be in the range -2^7(0b10000000) to 2^7(0b1111111), so it will always be in a valid UTF-8 range.

    My initial understanding to the question was that given a string, say "0b11111111", check if it is a valid UTF-8, I guess I was wrong.

    Moreover, Java does provide constructor to convert byte array to string, and if you are interested in the decode method, check here.

    One more thing, the above answer would be correct given another language. The only improvement could be:

    In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.

    So 4 bytes would be enough.

    I am definitely to this, so correct me if I am wrong. Thanks a lot.

    0 讨论(0)
  • 2020-12-16 00:57

    the CharsetDecoder might be what you are looking for:

    @Test
    public void testUTF8() throws CharacterCodingException {
        // the desired charset
        final Charset UTF8 = Charset.forName("UTF-8");
        // prepare decoder
        final CharsetDecoder decoder = UTF8.newDecoder();
        decoder.onMalformedInput(CodingErrorAction.REPORT);
        decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    
        byte[] bytes = new byte[48];
        new Random().nextBytes(bytes);
        ByteBuffer buffer = ByteBuffer.wrap(bytes);
        try {
            decoder.decode(buffer);
            fail("Should not be UTF-8");
        } catch (final CharacterCodingException e) {
            // noop, the test should fail here
        }
    
        final String string = "hallo welt!";
        bytes = string.getBytes(UTF8);
        buffer = ByteBuffer.wrap(bytes);
        final String result = decoder.decode(buffer).toString();
        assertEquals(string, result);
    }
    

    so your function might look like that:

    public static boolean checkEncoding(final byte[] bytes, final String encoding) {
        final CharsetDecoder decoder = Charset.forName(encoding).newDecoder();
        decoder.onMalformedInput(CodingErrorAction.REPORT);
        decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
        final ByteBuffer buffer = ByteBuffer.wrap(bytes);
    
        try {
            decoder.decode(buffer);
            return true;
        } catch (final CharacterCodingException e) {
            return false;
        }
    }
    
    0 讨论(0)
  • 2020-12-16 01:00

    Let's first have a look at a visual representation of the UTF-8 design.

    enter image description here


    Now let's resume what we have to do.

    • Loop over all character of the string (each character being a byte).
    • We will need to apply a mask to each byte depending on the codepoint as the x characters represent the actual codepoint. We will use the binary AND operator (&) which copy a bit to the result if it exists in both operands.
    • The goal of applying a mask is to remove the trailing bits so we compare the actual byte as the first code point. We will do the bitwise operation using 0b1xxxxxxx where 1 will appear "Bytes in sequence" time, and other bits will be 0.
    • We can then compare with the first byte to verify if it is valid, and also determinate what is the actual byte.
    • If the character entered in none of the case, it means the byte is invalid and we return "No".
    • If we can get out of the loop, that means each of the character are valid, hence the string is valid.
    • Make sure the comparison that returned true correspond to the expected length.

    The method would look like this :

    public static final boolean isUTF8(final byte[] pText) {
    
        int expectedLength = 0;
    
        for (int i = 0; i < pText.length; i++) {
            if ((pText[i] & 0b10000000) == 0b00000000) {
                expectedLength = 1;
            } else if ((pText[i] & 0b11100000) == 0b11000000) {
                expectedLength = 2;
            } else if ((pText[i] & 0b11110000) == 0b11100000) {
                expectedLength = 3;
            } else if ((pText[i] & 0b11111000) == 0b11110000) {
                expectedLength = 4;
            } else if ((pText[i] & 0b11111100) == 0b11111000) {
                expectedLength = 5;
            } else if ((pText[i] & 0b11111110) == 0b11111100) {
                expectedLength = 6;
            } else {
                return false;
            }
    
            while (--expectedLength > 0) {
                if (++i >= pText.length) {
                    return false;
                }
                if ((pText[i] & 0b11000000) != 0b10000000) {
                    return false;
                }
            }
        }
    
        return true;
    }
    

    Edit : The actual method is not the original one (almost, but not) and is stolen from here. The original one was not properly working as per @EJP comment.

    0 讨论(0)
  • 2020-12-16 01:01
    public static boolean validUTF8(byte[] input) {
        int i = 0;
        // Check for BOM
        if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
                && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
            i = 3;
        }
    
        int end;
        for (int j = input.length; i < j; ++i) {
            int octet = input[i];
            if ((octet & 0x80) == 0) {
                continue; // ASCII
            }
    
            // Check for UTF-8 leading byte
            if ((octet & 0xE0) == 0xC0) {
                end = i + 1;
            } else if ((octet & 0xF0) == 0xE0) {
                end = i + 2;
            } else if ((octet & 0xF8) == 0xF0) {
                end = i + 3;
            } else {
                // Java only supports BMP so 3 is max
                return false;
            }
    
            while (i < end) {
                i++;
                octet = input[i];
                if ((octet & 0xC0) != 0x80) {
                    // Not a valid trailing byte
                    return false;
                }
            }
        }
        return true;
    }
    
    0 讨论(0)
提交回复
热议问题