可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
is there any Python module which helps to decode the various forms of encoded mail headers, mainly Subject, to simple - say - UTF-8 strings?
Here are example Subject headers from mail files that I have:
Subject: [ 201105311136 ]=?UTF-8?B?IMKnIDE2NSBBYnM=?=. 1 AO; Subject: [ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?= Subject: [ 201105191633 ] =?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?= =?UTF-8?B?Z2VuIGVpbmVzIFNlZW1hbm5z?=
text - encoded sting - text
text - encoded string
text - encoded string - encoded string
Encodig could also be something else like ISO 8859-15.
Update 1: I forgot to mention, I tried email.header.decode_header
for item in message.items(): if item[0] == 'Subject': sub = email.header.decode_header(item[1]) logging.debug( 'Subject is %s' % sub )
This outputs
DEBUG:root:Subject is [('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]
which does not really help.
Update 2: Thanks to Ingmar Hupp in the comments.
the first example decodes to a list of two tupels:
print decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")
[('[ 201105161048 ] GewSt:', None), (' Wegfall der Vorl\xc3\xa4ufigkeit', 'utf-8')]
is this always [(string, encoding),(string, encoding), ...] so I need a loop to concat all the [0] items to one string or how to get it all in one string?
Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011
does not decode well:
print decode_header("""[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011""")
[('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]
回答1:
This type of encoding is known as MIME encoded-word and the email module can decode it:
from email.header import decode_header print decode_header("""=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=""")
This outputs a list of tuples, containing the decoded string and the encoding used. This is because the format supports different encodings in a single header. To merge these into a single string you need to convert them into a shared encoding and then concatenate this, which can be accomplished using Python's unicode object:
from email.header import decode_header dh = decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""") default_charset = 'ASCII' print ''.join([ unicode(t[0], t[1] or default_charset) for t in dh ])
Update 2:
The problem with this Subject line not decoding:
Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011 ^
Is actually the senders fault, which violates the requirement of encoded-words in a header being separated by white-space, specified in RFC 2047, section 5, paragraph 1: an 'encoded-word' that appears in a header field defined as '*text' MUST be separated from any adjacent 'encoded-word' or 'text' by 'linear-white-space'.
If need be, you can work around this by pre-processing these corrupt headers with a regex that inserts a whitespace after the encoded-word part (unless it's at the end), like so:
import re header_value = re.sub(r"(=\?.*\?=)(?!$)", r"\1 ", header_value)
回答2:
I was just testing with encoded headers in Python 3.3, and I found that this is a very convenient way to deal with them:
As you can see it automatically adds whitespace around the encoded words.
It internally keeps the encoded and ASCII header parts separate as you can see when it re-encodes the non-ASCII parts:
>>> h.encode() '[ 201105161048 ] GewSt: =?utf-8?q?_Wegfall_der_Vorl=C3=A4ufigkeit?='
If you want the whole header re-encoded you could convert the header to a string and then back into a header:
回答3:
def decode_header(value): return ' '.join((item[0].decode(item[1] or 'utf-8').encode('utf-8') for item in email.header.decode_header(value)))
回答4:
How about decoding headers in the following way:
import poplib, email from email.header import decode_header, make_header ... subject, encoding = decode_header(message.get('subject'))[0] if encoding==None: print "\n%s (%s)\n"%(subject, encoding) else: print "\n%s (%s)\n"%(subject.decode(encoding), encoding)
this gets subject from email and decodes it with specified encoding (or no decoding if encoding set to None).
Worked for me for encodings set as 'None', 'utf-8', 'koi8-r', 'cp1251', 'windows-1251'
回答5:
This script works fine for me.. I use this script to decode all email subjects
pat2=re.compile(r'(([^=]*)=\?([^\?]*)\?([BbQq])\?([^\?]*)\?=([^=]*))',re.IGNORECASE) def decodev2(a): data=pat2.findall(a) line=[] if data: for g in data: (raw,extra1,encoding,method,string,extra)=g extra1=extra1.replace('\r','').replace('\n','').strip() if len(extra1)>0: line.append(extra1) if method.lower()=='q': string=quopri.decodestring(string) string=string.replace("_"," ").strip() if method.lower()=='b': string=base64.b64decode(string) line.append(string.decode(encoding,errors='ignore')) extra=extra.replace('\r','').replace('\n','').strip() if len(extra)>0: line.append(extra) return "".join(line) else: return a
samples:
=?iso-8859-1?q?una-al-dia_=2806/04/2017=29_Google_soluciona_102_vulnerabi?= =?iso-8859-1?q?lidades_en_Android?= =?UTF-8?Q?Al=C3=A9grate?= : =?UTF-8?Q?=20La=20compra=20de=20tu=20vehi?= =?UTF-8?Q?culo=20en=20tan=20s=C3=B3lo=2024h?= =?UTF-8?Q?=2E=2E=2E=20=C2=A1Valoraci=C3=B3n=20=26?= =?UTF-8?Q?ago=20=C2=A0inmediato=21?=
回答6:
Python has an e-mail lib. http://docs.python.org/library/email.header.html
Take a look at email.header.decode_header()
回答7:
I had a similar issue, but my case was a little bit different:
- Python 3.5 (The question is from 2011, but still very high on google)
- Read message directly from file as byte-string
Now the cool feature of the python 3 email.parser is that all headers are automatically decoded to Unicode-Strings. However this causes a little "misfortune" when dealing with wrong headers. So following header caused the problem:
Subject: Re: =?ISO-2022-JP?B?GyRCIVYlMyUiMnE1RCFXGyhC?= (1/9(=?ISO-2022-JP?B?GyRCNmIbKEI=?=) 6:00pm-7:00pm) =?ISO-2022-JP?B?GyRCJE4kKkNOJGkkOxsoQg==?=
This resulted in the following msg['subject']
:
Re: 「コア会议」 (1/9(=?ISO-2022-JP?B?GyRCNmIbKEI=?=) 6:00pm-7:00pm) のお知らせ
Well the issue is uncompliance with RFC 2047 (There should be a line-white-space after the MIME encoded word) as already described in the answer by Ingmar Hupp. So my answer is inspired by his.
Solution 1: Fix byte-string before actually parsing the email. This seemed to be the better solution, however I was struggling to implement a Regex substitution on byte-strings. So I opted for solution 2:
Solution 2: Fix the already parsed and partly-decoded header value:
with open(file, 'rb') as fp: # read as byte-string msg = email.message_from_binary_file(fp, policy=policy.default) subject_fixed = fix_wrong_encoded_words_header(msg['subject']) def fix_wrong_encoded_words_header(header_value): fixed_header_value = re.sub(r"(=\?.*\?=)(?=\S)", r"\1 ", header_value) if fixed_header_value == header_value: # nothing needed to fix return header_value else: dh = decode_header(fixed_header_value) default_charset = 'unicode-escape' correct_header_value = ''.join([str(t[0], t[1] or default_charset) for t in dh]) return correct_header_value
Explanation of important parts:
I modified the regex of Ingmar Hupp to only replace wrong MIME encoded words: (=\?.*\?=)(?=\S)
Debuggex Demo. Because doing for all would heavily slow dow the parsing (Parsing about 150'000 mails).
After applying the decode_header
function to the fixed_header
, we have following parts in dh
:
dh == [(b'Re: \\u300c\\u30b3\\u30a2\\u4f1a\\u8b70\\u300d (1/9(', None), (b'\x1b$B6b\x1b(B', 'iso-2022-jp'), (b' ) 6:00pm-7:00pm) \\u306e\\u304a\\u77e5\\u3089\\u305b', None)]
To re-decode the unicode-escaped sequences, we set default_charset = 'unicode-escape'
when building the new header-value.
The correct_header_value
is now:
Re: 「コア会议」 (1/9(金 ) 6:00pm-7:00pm) のお知らせ'
I hope this will save somebody some time.
Addition: The answer by Sander Steffann didn't really help me, because I wasn't able to get the raw-value of the header-field out of the message-class.
回答8:
for me this worked perfect (and always gives me a string):
dmsgsubject, dmsgsubjectencoding = email.header.decode_header(msg['Subject'])[0] msgsubject = dmsgsubject.decode(*([dmsgsubjectencoding] if dmsgsubjectencoding else [])) if isinstance(dmsgsubject, bytes) else dmsgsubject