with open(\'result.txt\', \'r\') as f:
data = f.read()
print \'What type is my data:\'
print type(data)
for i in data:
print \"what is i:\"
print i
pri
When you call encode
on a str
with most (all?) codecs (for which encode
really makes no sense; str
is a byte oriented type, not a true text type like unicode
that would require encoding), Python is implicitly decodeing it as ASCII first, then encoding with your specified encoding. If you want the str
to be interpreted as something other than ASCII, you need to decode
from bytes-like str
to true text unicode
yourself.
When you do i.encode('utf-8')
when i
is a str
, you're implicitly saying i
is logically text (represented by bytes in the locale default encoding), not binary data. So in order to encode
it, it first needs to decode it to determine what the "logical" text is. Your input is probably encoded in some ASCII
superset (e.g. latin-1
, or even utf-8
), and contains non-ASCII bytes; it tries to decode
them using the ascii
codec (to figure out the true Unicode ordinals it needs to encode as utf-8
), and fails.
You need to do one of:
decode
the str
you read using the correct codec (to get a unicode
object), then encode
that back to utf-8
.open
, import io
and use io.open
(Python 2.7+ only; on Python 3+, io.open
and open
are the same function), which gets you an open
that works like Python 3's open
. You can pass this open
an encoding
argument (e.g. io.open('/path/to/file', 'r', encoding='latin-1')
) and read
ing from the resulting file object will get you already decode
-ed unicode
objects (that can then be encode
-ed to whatever you like with).Note: #1 will not work if the real encoding is something like utf-8
and you defer the work until you're iterating character by character. For non-ASCII characters, utf-8
is multibyte, so if you only have one byte, you can't decode
(because the following bytes are needed to calculate a single ordinal). This is a reason to prefer using io.open
to read as unicode
natively so you're not worrying about stuff like this.