Detect encoding in wrongly encoded UTF-8 text file

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-29 09:04:21

问题


I have an encoding issue.

I have millions of text files that I need to parse for a language data science project. Each text file is encoded as UTF-8, but I just found that some of these source files are not encoded properly.

For example. I have a Chinese text file, that is encoded as UTF-8, but text in the file looks like this:

Subject: »Ø¸´: ÎÒÉý¼¶µ½

When I use Python to detect the encoding of this Chinese text file:

Chardet tells me the file is encoded as UTF-8:

with open(path,'rb') as f:
    data = ""
    data = f.read()
    encoding=chardet.detect(data)['encoding']

UnicodeDammit also tells me the file is encoded as UTF-8:

with open(path,'rb') as f:
    data = ""
    data = f.read()
    encoding= UnicodeDammit(data).original_encoding

Meanwhile, I know it's not UTF-8, it should be GB2312 Chinese encoding instead. If I open this file in Notepad++, it's detected also as UTF-8 and all Chinese characters show as gibberish. Only if I manually switch encoding in Notepad++ to GB2312 I get the proper text:

Subject: 禄脴赂麓: 脦脪脡媒录露碌陆

I have a number of files like this, in all kinds of languages.

Is there a way I can detect encoding in these badly encoded UTF-8 files?

Example text file can be downloaded here: https://gofile.io/d/qMcgkt


回答1:


Eventually, I've figured it out. Using CharsetNormalizerMatches seems to work, properly detecting the encoding. Anyways, this is how I implemented it and it works like a charm, correctly detecting gb18030 encoding for the file in question:

from charset_normalizer import CharsetNormalizerMatches as CnM
encoding = CnM.from_path(path).best().first().encoding

Note: The answer was hinted to me by someone who suggested using CharsetNormalizerMatches, but later deleted his post here. Too bad, I'd love to give him/her the credit.




回答2:


You can't get a definite encoding of your linked example.txt file as concatenated two different encodings:

path = r'D:\Downloads\example.txt'
with open(path,'rb') as f:
    data = f.read()

# double mojibake
print( data[:37].decode('utf-8').encode('latin1').decode('gb2312') )

# Chinese
print( data[37:].decode('gb2312') )

Result pasted into Google Translate gives

Subject: Re: I upgraded to

The orange version of the orange version, should be corrected

Unfortunately, SO thinks that Chinese text in the result is a spam so I can't embed it here…

Body cannot contain "".

This appears to be spam. If you think we've made an error, make a post in meta.

Edit: print( data[:37].decode('gb18030')) returns

Subject: 禄脴赂麓: 脦脪脡媒录露碌陆

Google Translate then gives Subject: Lulululu: Lululu Lulu as an English equivalent for the latter string.
Anyway, abovementioned Subject: Re: I upgraded to (or Re: My promotion suggested by Mark Tolonen) looks to be more meaningful than this…



来源:https://stackoverflow.com/questions/65363507/detect-encoding-in-wrongly-encoded-utf-8-text-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!