Why do I get an “Invalid Byte Sequence in UTF-8” error reading a text file?

前端未结

关注

 5  1625

刺人心 2021-01-26 18:26

I\'m writing a Ruby script to process a large text file, and keep getting an odd encoding error. Here\'s the situation:

input_data = File.new(in_path, \'r\').rea


      
      
        
          5条回答        

        
                    
            
            
                         
                
              
              
                
                   青春惊慌失措
                                             
                
                
                (楼主)
            
              
              
                2021-01-26 19:16
              

            
            
                        
Here are 2 common situations and how to deal with them:


Situation 1

You have an UTF-8 input-file with possibly a few invalid bytes

Remove the invalid bytes:

test = "Partly valid\xE4 UTF-8 encoding: äöüß"
File.open( 'input_file', 'w' ) {|f| f.write(test)}

str  = File.read( 'input_file' )

str.scrub('')
   => "Partly valid UTF-8 encoding: äöüß"


Situation 2

You have an input-file that could be in either UTF-8 or ISO-8859-1 encoding

Check which encoding it is and convert to UTF-8 (if necessary):

test = "String in ISO-8859-1 encoding: \xE4\xF6\xFC\xDF"
File.open( 'input_file', 'w' ) {|f| f.write(test)}

str  = File.read( 'input_file' )

unless str.valid_encoding?
  str.encode!( 'UTF-8', 'ISO-8859-1', invalid: :replace )
end #unless
   => "String in ISO-8859-1 encoding: äöüß"


Notes



The above code snippets assume that Ruby encodes all your strings in UTF-8 by default. Even though, this is almost always the case, you can make sure of this by starting your scripts with # encoding: UTF-8.
If invalid, it is programmatically possible to detect most multi-byte encodings like UTF-8 (in Ruby, see: #valid_encoding?). However, it is NOT possible (or at least extremely hard) to programmatically detect invalidity of single-byte-encodings like ISO-8859-1. Thus the above code snippet does not work the other way around, i.e. detecting if a String is valid ISO-8859-1 encoding.
Even though UTF-8 has become increasingly popular as the default encoding in computer-systems, ISO-8859-1 and other Latin1 flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary from ISO-8859-1. Examples: CP1252 (a.k.a. Windows-1252), ISO-8859-15


[ruby] [encoding] [utf8] [file-encoding] [character-encoding]
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它5个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复