Java Strings Character Encoding - For French - Dutch Locales

后端未结

关注

 3  2048

执念已碎 2021-01-13 16:26

I have the following piece of code

public static void main(String[] args) throws UnsupportedEncodingException {
        System.out.println(Charset.defaultCha


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   刺人心
                                             
                
                
                (楼主)
            
              
              
                2021-01-13 17:18
              

            
            
                        
When you call upon String getBytes method it:


  Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.


So whenever you do:

accentedE.getBytes()


it takes the contents of accentedE String as bytes encoded in the default OS code page, in your case cp-1252.

This line:

new String(accentedE.getBytes(), Charset.forName("UTF-8"))


takes the accentedE bytes (encoded in cp1252) and tries to decode them in UTF-8, hence the error. The same situation from the other side for:

new String(accentedE.getBytes("utf-8"))


The getBytes method takes the accentedE bytes encoded in cp-1252, reencodes them in UTF-8 but then the String constructor encodes them with the default OS codepage which is cp-1252.


  Constructs a new String by decoding the specified array of bytes using the platform's default charset. The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array.


I strongly recommend reading this excellent article:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

UPDATE:

In short, every character is stored as a number. In order to know which character is which number the OS uses the codepages. Consider the following snippet:

String accentedE = "é";

System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[0]));
System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[1]));
System.out.println(String.format("%02X ", accentedE.getBytes("windows-1252")[0]));


which outputs:

C3 
A9 
E9


That is because small accented e in UTF-8 is stored as two bytes of value C3A9, while in cp-1252 is stored as a single byte of value E9. For detailed explanation read the linked article.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复