Detect encoding in wrongly encoded UTF-8 text file
问题 I have an encoding issue. I have millions of text files that I need to parse for a language data science project. Each text file is encoded as UTF-8, but I just found that some of these source files are not encoded properly. For example. I have a Chinese text file, that is encoded as UTF-8, but text in the file looks like this: Subject: »Ø¸´: ÎÒÉý¼¶µ½ When I use Python to detect the encoding of this Chinese text file: Chardet tells me the file is encoded as UTF-8: with open(path,'rb') as f: