问题
I have a plain ASCII file. When I try to open it with codecs.open(..., "utf-8")
, I am unable to read single characters. ASCII is a subset of UTF-8, so why can't codecs
open such a file in UTF-8 mode?
# test.py
import codecs
f = codecs.open("test.py", "r", "utf-8")
# ASCII is supposed to be a subset of UTF-8:
# http://www.fileformat.info/info/unicode/utf8.htm
assert len(f.read(1)) == 1 # OK
f.readline()
c = f.read(1)
print len(c)
print "'%s'" % c
assert len(c) == 1 # fails
# max% p test.py
# 63
# '
# import codecs
#
# f = codecs.open("test.py", "r", "utf-8")
#
# # ASC'
# Traceback (most recent call last):
# File "test.py", line 15, in <module>
# assert len(c) == 1 # fails
# AssertionError
# max%
system:
Linux max 4.4.0-89-generic #112~14.04.1-Ubuntu SMP Tue Aug 1 22:08:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Of course it works with regular open
. It also works if I remove the "utf-8"
option. Also what does 63
mean? That's like the middle of the 3rd line. I don't get it.
回答1:
Found your problem:
When passed an encoding, codecs.open
returns a StreamReaderWriter
, which is really just a wrapper around (not a subclass of; it's a "composed of" relationship, not inheritance) StreamReader
and StreamWriter
. Problem is:
StreamReaderWriter
provides a "normal"read
method (that is, it takes asize
parameter and that's it)- It delegates to the internal StreamReader.read method, where the
size
argument is only a hint as to the number of bytes to read, but not a limit; the second argument,chars
, is a strict limiter, butStreamReaderWriter
never passes that argument along (it doesn't accept it) - When
size
hinted, but not capped usingchars
, ifStreamReader
has buffered data, and it's large enough to match thesize
hintStreamReader.read
blindly returns the contents of the buffer, rather than limiting it in any way based on thesize
hint (after all, onlychars
imposes a maximum return size)
The API of StreamReader.read
and the meaning of size
/chars
for the API is the only documented thing here; the fact that codecs.open
returns StreamReaderWriter
is not contractual, nor is the fact that StreamReaderWriter
wraps StreamReader
, I just used ipython
's ??
magic to read the source code of the codecs
module to verify this behavior. But documented or not, that's what it's doing (feel free to read the source code for StreamReaderWriter
, it's all Python level, so it's easy).
The best solution is to switch to io.open
, which is faster and more correct in every standard case (codecs.open
supports the weirdo codecs that don't convert between bytes
[Py2 str
] and str
[Py2 unicode
], but rather, handle str
to str
or bytes
to bytes
encodings, but that's an incredibly limited use case; most of the time, you're converting between bytes
and str
). All you need to do is import io
instead of codecs
, and change the codecs.open
line to:
f = io.open("test.py", encoding="utf-8")
The rest of your code can remain unchanged (and will likely run faster to boot).
As an alternative, you could explicitly bypass StreamReaderWriter
to get the StreamReader
's read
method and pass the limiting argument directly, e.g. change:
c = f.read(1)
to:
# Pass second, character limiting argument after size hint
c = f.reader.read(6, 1) # 6 is sort of arbitrary; should ensure a full char read in one go
I suspect Python Bug #8260, which covers intermingling readline
and read
on codecs.open
created file objects, applies here, officially, it's "fixed", but if you read the comments, the fix wasn't complete (and may not be possible to complete given the documented API); arbitrarily weird combinations of read
and readline
will be able to break it.
Again, just use io.open
; as long as you're on Python 2.6 or higher, it's available, and it's just plain better.
来源:https://stackoverflow.com/questions/46437761/codecs-openutf-8-fails-to-read-plain-ascii-file