How to guess the encoding of a file with no BOM in .NET?

前端未结

关注

 8  653

I\'m using the StreamReader class in .NET like this:

using( StreamReader reader = new StreamReader( \"c:\\somefile.html\", true ) {
    string filetext = rea


                      
              相关标签:


      
      
        
          8条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情深已故        
                
              
                            
                2020-12-14 13:39
              
            
            
                                                                       
I had good luck with Pude, a C# port of Mozilla Universal Charset Detector. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  爱一瞬间的悲伤        
                
              
                            
                2020-12-14 13:44
              
            
            
                                                                       
A hacky technique might be to take an MD5 of the text, then decode the text and re-encode it in various encodings, MD5'ing each one. If one matches you guess it's that encoding.

That's obviously too slow for something that handles a lot of files but for something like a text editor I could see it working.

Other than that, it'll be hands dirty porting the java libraries from this post that came from the Delphi SO question, or using the IE MLang feature.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遥遥无期        
                
              
                            
                2020-12-14 13:49
              
            
            
                                                                       
UTF-8 is designed in a way that it is unlikely to have a text encoded in an arbitrary 8bit-encoding like latin1 being decoded to proper unicode using UTF-8.

So the minimum approach is this (pseudocode, I don't talk .NET):

try:
   u = some_text.decode("UTF-8")
except UnicodeDecodeError:
   u = some_text.decode("most-likely-encoding")

For the most-likely-encoding one usually uses e.g. latin1 or cp1252 or whatever. More sophisticated approaches might try & find language-specific character pairings, but I'm not aware of something that does that as a library or some such.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  谎友^        
                
              
                            
                2020-12-14 14:01
              
            
            
                                                                       
I used this to do something similar a while back:

http://www.conceptdevelopment.net/Localization/NCharDet/
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  耶瑟儿～        
                
              
                            
                2020-12-14 14:01
              
            
            
                                                                       
See my (recent) answer to this (as far as I can tell, equivalent) question: How can I detect the encoding/codepage of a text file

It does NOT attempt to guess across a range of possible "national" encodings like MLang and NCharDet do, but rather assumes you know what kind of non-unicode files you're likely to encounter. As far as I can tell from your question, it should address your problem pretty reliably (without relying on the "black box" of MLang).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2020-12-14 14:03
              
            
            
                                                                       
You should read this article by Raymond Chen.  He goes into detail on how programs can guess what an encoding is (and some of the fun that comes from guessing).

Some files come up strange in Notepad
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复