Detect encoding in wrongly encoded UTF-8 text file

问题

I have an encoding issue.

I have millions of text files that I need to parse for a language data science project. Each text file is encoded as UTF-8, but I just found that some of these source files are not encoded properly.

For example. I have a Chinese text file, that is encoded as UTF-8, but text in the file looks like this:

Subject: »Ø¸´: ÎÒÉý¼¶µ½

When I use Python to detect the encoding of this Chinese text file:

Chardet tells me the file is encoded as UTF-8:

with open(path,'rb') as f:
    data = ""
    data = f.read()
    encoding=chardet.detect(data)['encoding']

UnicodeDammit also tells me the file is encoded as UTF-8:

with open(path,'rb') as f:
    data = ""
    data = f.read()
    encoding= UnicodeDammit(data).original_encoding

Meanwhile, I know it's not UTF-8, it should be GB2312 Chinese encoding instead. If I open this file in Notepad++, it's detected also as UTF-8 and all Chinese characters show as gibberish. Only if I manually switch encoding in Notepad++ to GB2312 I get the proper text:

Subject: 禄脴赂麓: 脦脪脡媒录露碌陆

I have a number of files like this, in all kinds of languages.

Is there a way I can detect encoding in these badly encoded UTF-8 files?

Example text file can be downloaded here: https://gofile.io/d/qMcgkt

回答1:

Eventually, I've figured it out. Using CharsetNormalizerMatches seems to work, properly detecting the encoding. Anyways, this is how I implemented it and it works like a charm, correctly detecting gb18030 encoding for the file in question:

from charset_normalizer import CharsetNormalizerMatches as CnM
encoding = CnM.from_path(path).best().first().encoding

Note: The answer was hinted to me by someone who suggested using CharsetNormalizerMatches, but later deleted his post here. Too bad, I'd love to give him/her the credit.

回答2:

You can't get a definite encoding of your linked example.txt file as concatenated two different encodings:

path = r'D:\Downloads\example.txt'
with open(path,'rb') as f:
    data = f.read()

# double mojibake
print( data[:37].decode('utf-8').encode('latin1').decode('gb2312') )

# Chinese
print( data[37:].decode('gb2312') )

Result pasted into Google Translate gives

Subject: Re: I upgraded to

The orange version of the orange version, should be corrected

Unfortunately, SO thinks that Chinese text in the result is a spam so I can't embed it here…

Body cannot contain "".

This appears to be spam. If you think we've made an error, make a post in meta.

Edit: print( data[:37].decode('gb18030')) returns

Subject: 禄脴赂麓: 脦脪脡媒录露碌陆

Google Translate then gives Subject: Lulululu: Lululu Lulu as an English equivalent for the latter string.
Anyway, abovementioned Subject: Re: I upgraded to (or Re: My promotion suggested by Mark Tolonen) looks to be more meaningful than this…

来源：https://stackoverflow.com/questions/65363507/detect-encoding-in-wrongly-encoded-utf-8-text-file

标签

python

encoding