How to determine if a String contains invalid encoded characters

前端 未结 10 1339
眼角桃花
眼角桃花 2020-12-02 11:38

Usage scenario

We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the webs

10条回答
  •  一个人的身影
    2020-12-02 12:00

    I asked the same question,

    Handling Character Encoding in URI on Tomcat

    I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,

    1. Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.
    2. If you have to manually URL decode, use Latin1 as charset also.
    3. Use the fixEncoding() function to fix up encodings.

    For example, to get a parameter from query string,

      String name = fixEncoding(request.getParameter("name"));
    

    You can do this always. String with correct encoding is not changed.

    The code is attached. Good luck!

     public static String fixEncoding(String latin1) {
      try {
       byte[] bytes = latin1.getBytes("ISO-8859-1");
       if (!validUTF8(bytes))
        return latin1;   
       return new String(bytes, "UTF-8");  
      } catch (UnsupportedEncodingException e) {
       // Impossible, throw unchecked
       throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());
      }
    
     }
    
     public static boolean validUTF8(byte[] input) {
      int i = 0;
      // Check for BOM
      if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
        && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
       i = 3;
      }
    
      int end;
      for (int j = input.length; i < j; ++i) {
       int octet = input[i];
       if ((octet & 0x80) == 0) {
        continue; // ASCII
       }
    
       // Check for UTF-8 leading byte
       if ((octet & 0xE0) == 0xC0) {
        end = i + 1;
       } else if ((octet & 0xF0) == 0xE0) {
        end = i + 2;
       } else if ((octet & 0xF8) == 0xF0) {
        end = i + 3;
       } else {
        // Java only supports BMP so 3 is max
        return false;
       }
    
       while (i < end) {
        i++;
        octet = input[i];
        if ((octet & 0xC0) != 0x80) {
         // Not a valid trailing byte
         return false;
        }
       }
      }
      return true;
     }
    

    EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?

    Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.

    The solution I posted here is not perfect but it's the best one we found so far.

提交回复
热议问题