Python 3 - if a string contains only ASCII, is it equal to the string as bytes?

三世轮回 提交于 2019-12-13 14:08:58

问题


Consider Python 3 SMTPD - the data received is contained in a string. http://docs.python.org/3.4/library/smtpd.html quote: "and data is a string containing the contents of the e-mail"

Facts (correct?):

  • Strings in Python 3 are Unicode.
  • Emails are always ASCII.
  • Pure ASCII is valid Unicode.

Therefore the email that came in is pure ASCII (which is valid Unicode), therefore the SMTPD DATA string is exactly equivalent to the original bytes received by SMPTD. Is this correct?

Thus my question, if I decode the SMTPD DATA string to ASCII, or convert the DATA string to bytes, is this equivalent to the bytes of the actual email message that arrived via SMTP?

Context, (and perhaps a better question) is "How do I save to a file Python 3's SMTPD DATA as PRECISELY the bytes that were received?" My concern is that when DATA goes through string to bytes conversion then somehow it has been changed from the original bytes that arrived via SMTP.

EDIT: it seems the Python developers think SMTPD should be returning binary data anyway. Doesn't seem to have been fixed... http://bugs.python.org/issue19662


回答1:


if a string contains only ASCII, is it equal to the string as bytes?

No. It is not equal in Python 3:

>>> '1' == b'1'
False

bytes object is not equal to str (Unicode string) object in a similar way that an integer is not equal to a string:

>>> '1' == 1
False

In some programming languages the above comparisons are true e.g., in Python 2:

>>> b'1' == u'1'
True

and 1 == '1' in Perl:

$ perl -e "print qq(True\n) if 1 == q(1)"
True

Your question is a good example of why the stricter Python 3 behaviour is preferable. It forces programmers to confront their text/bytes misconceptions without waiting for their code to break for some input.


  • Strings in Python 3 are Unicode.

yes. Strings are immutable sequences of Unicode code points in Python 3.

  • Emails are always ASCII.

Most emails are transported as 7-bit messages (ASCII range: hex 00-7F). Though "virtually all modern email servers are 8-bit clean." i.e., 8-bit content won't be corrupted. And 8BITMIME extension sanctions the passing of some of 8-bit content.

In other words: emails are not always ASCII.

  • Pure ASCII is valid Unicode.

ASCII is a character encoding. You can decode some byte sequences to Unicode using US-ASCII character encoding. Unicode strings have no associated character encoding i.e., you can encode them into bytes using any character encoding that can represent corresponding Unicode code points.

Therefore the email that came in is pure ASCII (which is valid Unicode), therefore the SMTPD DATA string is exactly equivalent to the original bytes received by SMPTD. Is this correct?

If input is in ascii range then data.decode('ascii', 'strict').encode('ascii') == data. Though Lib/smtpd.py does some conversions to the input data (according to RFC 5321) therefore the content that you get as data may be different even if the input is pure ASCII.


"How do I save to a file Python 3's SMTPD DATA as PRECISELY the bytes that were received?"

my goal is not to find malformed emails but to save inbound emails to disk in precisely the binary/bytes form that they arrived.

The bug that you've linked (smtpd.py should not decode utf-8) makes smptd.py non 8-bit clean.

You could override SMTPChannel.collect_incoming_data method from smtpd.py to save incoming bytes as is.


"A string of ASCII text is also valid UTF-8 text."

It is true. It is a nice property of UTF-8 encoding. If you can decode a byte sequence into Unicode using US-ASCII character encoding then you can also decode the bytes using UTF-8 character encoding (and the resulting Unicode code points are the same in both cases).

smptd.py should have used either latin1 (it decodes any byte sequence) or ascii (with 'strict' error handler to fail on any non-ascii byte) instead of utf-8 (it allows some non-ascii bytes -- bad).

Keep in mind:

  • some emails may have bytes outside ascii range
  • de-transparency according to RFC 5321 doesn't preserve input bytes as-is even if they are all in ascii range


来源:https://stackoverflow.com/questions/21615662/python-3-if-a-string-contains-only-ascii-is-it-equal-to-the-string-as-bytes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!