问题
I've been searching the web for a solution to address reading files with different encodings and I've found many instances of "it's impossible to tell what encoding a file is" (so if anyone is reading this and has a link I would appreciate it). However, the problem I was dealing with was a bit more focused than "open any file encoding" but rather open a set of known encodings. I am by no means an expert at this topic but I thought I would post my solution in case anyone ran into this issue.
Specific example:
Known file encodings: utf8, and windows ansi
Initial Issue: as I now know, not specifying a encoding to python's open('file', 'r') command auto defaults to encoding='utf8' That raised a UnicodeDecodeError at runtime when trying to f.readline() a ansi file. A common search on this is: "UnicodeDecodeError: 'utf-8' codec can't decode byte"
Secondary Issue: so then I thought okay, well simple enough, we know the exception that's being raised so read a line and if it raises this UnicodeDecodeError then close the file and reopen it with open('file', 'r', encoding='ansi'). The problem with this was that sometimes utf8 was able to read the first few lines of an ansi encoded file just fine but then failed on a later line. Now the solution became clear; I had to read through the entire file with utf8 and if it failed then I knew that this file was a ansi.
I'll post my take on this as an answer but if someone has a better solution, I would also appreciate that :)
回答1:
f = open(path, 'r', encoding='utf8')
while True:
try:
line = f.readline()
except UnicodeDecodeError:
f.close()
encodeing = 'ansi'
break
if not line:
f.close()
encoding = 'utf8'
break
# now open your file for actual reading and data handling
with open(path, 'r', encoding=encoding) as f:
回答2:
If you replace the codec in the linke question by tripleee, it is
import codecs
last_position = -1
def mixed_decoder(unicode_error):
global last_position
string = unicode_error[1]
position = unicode_error.start
if position <= last_position:
position = last_position + 1
last_position = position
new_char = string[position].decode("ansi")
#new_char = u"_"
return new_char, position + 1
codecs.register_error("mixed", mixed_decoder)
Bonus: reads as UTF-8 until an error occurs and does not need in-place error handling.
来源:https://stackoverflow.com/questions/55551208/how-to-read-multiple-known-file-encodings