问题
I have an encoding issue.
I have millions of text files that I need to parse for a language data science project. Each text file is encoded as UTF-8, but I just found that some of these source files are not encoded properly.
For example. I have a Chinese text file, that is encoded as UTF-8, but text in the file looks like this:
Subject: »Ø¸´: ÎÒÉý¼¶µ½
When I use Python to detect the encoding of this Chinese text file:
Chardet tells me the file is encoded as UTF-8:
with open(path,'rb') as f:
data = ""
data = f.read()
encoding=chardet.detect(data)['encoding']
UnicodeDammit also tells me the file is encoded as UTF-8:
with open(path,'rb') as f:
data = ""
data = f.read()
encoding= UnicodeDammit(data).original_encoding
Meanwhile, I know it's not UTF-8, it should be GB2312 Chinese encoding instead. If I open this file in Notepad++, it's detected also as UTF-8 and all Chinese characters show as gibberish. Only if I manually switch encoding in Notepad++ to GB2312 I get the proper text:
Subject: 禄脴赂麓: 脦脪脡媒录露碌陆
I have a number of files like this, in all kinds of languages.
Is there a way I can detect encoding in these badly encoded UTF-8 files?
Example text file can be downloaded here: https://gofile.io/d/qMcgkt
回答1:
Eventually, I've figured it out. Using CharsetNormalizerMatches seems to work, properly detecting the encoding. Anyways, this is how I implemented it and it works like a charm, correctly detecting gb18030 encoding for the file in question:
from charset_normalizer import CharsetNormalizerMatches as CnM
encoding = CnM.from_path(path).best().first().encoding
Note: The answer was hinted to me by someone who suggested using CharsetNormalizerMatches, but later deleted his post here. Too bad, I'd love to give him/her the credit.
回答2:
You can't get a definite encoding of your linked example.txt file as concatenated two different encodings:
path = r'D:\Downloads\example.txt'
with open(path,'rb') as f:
data = f.read()
# double mojibake
print( data[:37].decode('utf-8').encode('latin1').decode('gb2312') )
# Chinese
print( data[37:].decode('gb2312') )
Result pasted into Google Translate gives
Subject: Re: I upgraded to The orange version of the orange version, should be corrected
Unfortunately, SO thinks that Chinese text in the result is a spam so I can't embed it here…
Body cannot contain "".
This appears to be spam. If you think we've made an error, make a post in meta.
Edit: print( data[:37].decode('gb18030'))
returns
Subject: 禄脴赂麓: 脦脪脡媒录露碌陆
Google Translate then gives Subject: Lulululu: Lululu Lulu
as an English equivalent for the latter string.
Anyway, abovementioned Subject: Re: I upgraded to
(or Re: My promotion
suggested by Mark Tolonen) looks to be more meaningful than this…
来源:https://stackoverflow.com/questions/65363507/detect-encoding-in-wrongly-encoded-utf-8-text-file