How to guess the encoding of a file with no BOM in .NET?

前端 未结 8 634
野趣味
野趣味 2020-12-14 13:07

I\'m using the StreamReader class in .NET like this:

using( StreamReader reader = new StreamReader( \"c:\\somefile.html\", true ) {
    string filetext = rea         


        
8条回答
  •  遥遥无期
    2020-12-14 13:49

    UTF-8 is designed in a way that it is unlikely to have a text encoded in an arbitrary 8bit-encoding like latin1 being decoded to proper unicode using UTF-8.

    So the minimum approach is this (pseudocode, I don't talk .NET):

    try: u = some_text.decode("UTF-8") except UnicodeDecodeError: u = some_text.decode("most-likely-encoding")

    For the most-likely-encoding one usually uses e.g. latin1 or cp1252 or whatever. More sophisticated approaches might try & find language-specific character pairings, but I'm not aware of something that does that as a library or some such.

提交回复
热议问题