Converting from Windows 1252 to UTF8 in Java: null characters with CharsetDecoder/Encoder

后端未结

关注

 2  2262

I know it\'s a very general question but I\'m becoming mad.

I used this code:

String ucs2Content = new String(bufferToConvert, inputEncoding);


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  死守一世寂寞        
                
              
                            
                2021-02-20 06:25
              
            
            
                                                                       
I am not sure how you get a sequence of null characters.  Try this

String outputEncoding = "UTF-8";
Charset charsetOutput = Charset.forName(outputEncoding);
CharsetEncoder encoder = charsetOutput.newEncoder();

// Convert the byte array from starting inputEncoding into UCS2
byte[] bufferToConvert = "Hello World! £€".getBytes();
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));

// Convert the internal UCS2 representation into outputEncoding
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
System.out.println(new String(bbuf.array(), 0, bbuf.limit(), charsetOutput));


prints

Hello World! £€

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  攒了一身酷        
                
              
                            
                2021-02-20 06:30
              
            
            
                                                                       
Your problem is that ByteBuffer.array() returns a direct reference to the array used as backing store for the ByteBuffer and not a copy of the backing array's valid range. You have to obey bbuf.limit() (as Peter did in his response) and just use the array content from index 0 to bbuf.limit()-1.

The reason for the extra 0 values in the backing array is a slight flaw in how the resulting ByteBuffer is created by the CharsetEncoder. Each CharsetEncoder has an "average bytes per character", which for the UCS2 encoder seem to be simple and correct (2 bytes/char). Obeying this fixed value, the CharsetEncoder initially allocates a ByteBuffer with "string length * average bytes per character" bytes, in this case e.g. 20 bytes for a 10 character long string. The UCS2 CharsetEncoder starts however with a BOM (byte order mark), which also occupies 2 bytes, so that only 9 of the 10 characters fit in the allocated ByteBuffer. The CharsetEncoder detects the overflow and allocates a new ByteBuffer with a length of 2*n+1 (n being the original length of the ByteBuffer), in this case 2*20+1 = 41 bytes. Since only 2 of the 21 new bytes are required to encode the remaining character, the array you get from bbuf.array() will have a length of 41 bytes, but bbuf.limit() will indicate that only the first 22 entries are actually used.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复