How to read multiple known file encodings

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-29 08:32:01

问题


I've been searching the web for a solution to address reading files with different encodings and I've found many instances of "it's impossible to tell what encoding a file is" (so if anyone is reading this and has a link I would appreciate it). However, the problem I was dealing with was a bit more focused than "open any file encoding" but rather open a set of known encodings. I am by no means an expert at this topic but I thought I would post my solution in case anyone ran into this issue.

Specific example:

Known file encodings: utf8, and windows ansi

Initial Issue: as I now know, not specifying a encoding to python's open('file', 'r') command auto defaults to encoding='utf8' That raised a UnicodeDecodeError at runtime when trying to f.readline() a ansi file. A common search on this is: "UnicodeDecodeError: 'utf-8' codec can't decode byte"

Secondary Issue: so then I thought okay, well simple enough, we know the exception that's being raised so read a line and if it raises this UnicodeDecodeError then close the file and reopen it with open('file', 'r', encoding='ansi'). The problem with this was that sometimes utf8 was able to read the first few lines of an ansi encoded file just fine but then failed on a later line. Now the solution became clear; I had to read through the entire file with utf8 and if it failed then I knew that this file was a ansi.

I'll post my take on this as an answer but if someone has a better solution, I would also appreciate that :)


回答1:


f = open(path, 'r', encoding='utf8')
while True:
    try:
        line = f.readline()
    except UnicodeDecodeError:
        f.close()
        encodeing = 'ansi'
        break
    if not line:
        f.close()
        encoding = 'utf8'
        break

# now open your file for actual reading and data handling
with open(path, 'r', encoding=encoding) as f:



回答2:


If you replace the codec in the linke question by tripleee, it is

import codecs

last_position = -1

def mixed_decoder(unicode_error):
    global last_position
    string = unicode_error[1]
    position = unicode_error.start
    if position <= last_position:
        position = last_position + 1
    last_position = position
    new_char = string[position].decode("ansi")
    #new_char = u"_"
    return new_char, position + 1

codecs.register_error("mixed", mixed_decoder)

Bonus: reads as UTF-8 until an error occurs and does not need in-place error handling.



来源:https://stackoverflow.com/questions/55551208/how-to-read-multiple-known-file-encodings

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!