How to check the charset of string in Java?

后端 未结 5 1717
梦谈多话
梦谈多话 2020-12-07 00:29

In my application I\'m getting the user info from LDAP and sometimes the full username comes in a wrong charset. For example:

ТеÑÑ61 ТеÑÑовиÑ61


        
相关标签:
5条回答
  • 2020-12-07 00:50

    I had the same problem. Tika is too large and juniversalchardet do not detect ISO-8859-1. So, I did myself and now is working well in production:

    public String convert(String value, String fromEncoding, String toEncoding) {
      return new String(value.getBytes(fromEncoding), toEncoding);
    }
    
    public String charset(String value, String charsets[]) {
      String probe = StandardCharsets.UTF_8.name();
      for(String c : charsets) {
        Charset charset = Charset.forName(c);
        if(charset != null) {
          if(value.equals(convert(convert(value, charset.name(), probe), probe, charset.name()))) {
            return c;
          }
        }
      }
      return StandardCharsets.UTF_8.name();
    }
    

    Full description here: Detect the charset in Java strings.

    0 讨论(0)
  • 2020-12-07 00:52

    Your LDAP database is set up incorrectly. The application putting data into it should convert to a known character set encoding, in your case, likely UTF_16. Pick a standard. All methods of detecting encoding are guesses.

    The application writing the value is the only one that knows definitively which encoding it is using and can properly convert to another encoding such as UTF_16.

    0 讨论(0)
  • 2020-12-07 00:59

    Strings in java, AFAIK, do not retain their original encoding - they are always stored internally in some Unicode form. You want to detect the charset of the original stream/bytes - this is why I think your String.toBytes() call is too late.

    Ideally if you could get the input stream you are reading from, you can run it through something like this: http://code.google.com/p/juniversalchardet/

    There are plenty of other charset detectors out there as well

    0 讨论(0)
  • 2020-12-07 01:00

    I recommend Apache.tika CharsetDetector, very friendly and strong.

    CharsetDetector detector = new CharsetDetector();
    detector.setText(yourStr.getBytes());
    detector.detect();  // <- return the result, you can check by .getName() method
    

    Further, you can convert any encoded string to your desired one, take utf-8 as example:

    detector.getString(yourStr.getBytes(), "utf-8");
    
    0 讨论(0)
  • 2020-12-07 01:07

    In your web-application, you may declare an encoding-filter that makes sure you receive data in the right encoding.

    <filter>
        <description>Explicitly set the encoding of the page to UTF-8</description>
        <filter-name>encodingFilter</filter-name>
        <filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
        <init-param>
            <param-name>encoding</param-name>
            <param-value>UTF-8</param-value>
        </init-param>
        <init-param>
            <param-name>forceEncoding</param-name>
            <param-value>true</param-value>
        </init-param>
    </filter>
    

    A spring provided filter makes sure that the controllers/servlets receive parameters in UTF-8.

    0 讨论(0)
提交回复
热议问题