Python - email header decoding UTF-8

匿名 (未验证) 提交于 2019-12-03 08:52:47

问题:

is there any Python module which helps to decode the various forms of encoded mail headers, mainly Subject, to simple - say - UTF-8 strings?

Here are example Subject headers from mail files that I have:

Subject: [ 201105311136 ]=?UTF-8?B?IMKnIDE2NSBBYnM=?=. 1 AO; Subject: [ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?= Subject: [ 201105191633 ]   =?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=   =?UTF-8?B?Z2VuIGVpbmVzIFNlZW1hbm5z?= 

text - encoded sting - text

text - encoded string

text - encoded string - encoded string

Encodig could also be something else like ISO 8859-15.

Update 1: I forgot to mention, I tried email.header.decode_header

    for item in message.items():     if item[0] == 'Subject':             sub = email.header.decode_header(item[1])             logging.debug( 'Subject is %s' %  sub ) 

This outputs

DEBUG:root:Subject is [('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]

which does not really help.

Update 2: Thanks to Ingmar Hupp in the comments.

the first example decodes to a list of two tupels:

print decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")
[('[ 201105161048 ] GewSt:', None), (' Wegfall der Vorl\xc3\xa4ufigkeit', 'utf-8')]

is this always [(string, encoding),(string, encoding), ...] so I need a loop to concat all the [0] items to one string or how to get it all in one string?

Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011

does not decode well:

print decode_header("""[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011""")

[('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]

回答1:

This type of encoding is known as MIME encoded-word and the email module can decode it:

from email.header import decode_header print decode_header("""=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=""") 

This outputs a list of tuples, containing the decoded string and the encoding used. This is because the format supports different encodings in a single header. To merge these into a single string you need to convert them into a shared encoding and then concatenate this, which can be accomplished using Python's unicode object:

from email.header import decode_header dh = decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""") default_charset = 'ASCII' print ''.join([ unicode(t[0], t[1] or default_charset) for t in dh ]) 

Update 2:

The problem with this Subject line not decoding:

Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011                                                                      ^ 

Is actually the senders fault, which violates the requirement of encoded-words in a header being separated by white-space, specified in RFC 2047, section 5, paragraph 1: an 'encoded-word' that appears in a header field defined as '*text' MUST be separated from any adjacent 'encoded-word' or 'text' by 'linear-white-space'.

If need be, you can work around this by pre-processing these corrupt headers with a regex that inserts a whitespace after the encoded-word part (unless it's at the end), like so:

import re header_value = re.sub(r"(=\?.*\?=)(?!$)", r"\1 ", header_value) 


回答2:

I was just testing with encoded headers in Python 3.3, and I found that this is a very convenient way to deal with them:

As you can see it automatically adds whitespace around the encoded words.

It internally keeps the encoded and ASCII header parts separate as you can see when it re-encodes the non-ASCII parts:

>>> h.encode() '[ 201105161048 ] GewSt: =?utf-8?q?_Wegfall_der_Vorl=C3=A4ufigkeit?=' 

If you want the whole header re-encoded you could convert the header to a string and then back into a header:



回答3:

def decode_header(value):     return ' '.join((item[0].decode(item[1] or 'utf-8').encode('utf-8') for item in email.header.decode_header(value))) 


回答4:

How about decoding headers in the following way:

import poplib, email  from email.header import decode_header, make_header  ...          subject, encoding = decode_header(message.get('subject'))[0]          if encoding==None:             print "\n%s (%s)\n"%(subject, encoding)         else:             print "\n%s (%s)\n"%(subject.decode(encoding), encoding) 

this gets subject from email and decodes it with specified encoding (or no decoding if encoding set to None).

Worked for me for encodings set as 'None', 'utf-8', 'koi8-r', 'cp1251', 'windows-1251'



回答5:

This script works fine for me.. I use this script to decode all email subjects

pat2=re.compile(r'(([^=]*)=\?([^\?]*)\?([BbQq])\?([^\?]*)\?=([^=]*))',re.IGNORECASE)  def decodev2(a):     data=pat2.findall(a)     line=[]     if data:             for g in data:                     (raw,extra1,encoding,method,string,extra)=g                     extra1=extra1.replace('\r','').replace('\n','').strip()                     if len(extra1)>0:                             line.append(extra1)                     if method.lower()=='q':                             string=quopri.decodestring(string)                             string=string.replace("_"," ").strip()                     if method.lower()=='b':                             string=base64.b64decode(string)                     line.append(string.decode(encoding,errors='ignore'))                     extra=extra.replace('\r','').replace('\n','').strip()                     if len(extra)>0:                             line.append(extra)             return "".join(line)     else:             return a 

samples:

=?iso-8859-1?q?una-al-dia_=2806/04/2017=29_Google_soluciona_102_vulnerabi?=  =?iso-8859-1?q?lidades_en_Android?=  =?UTF-8?Q?Al=C3=A9grate?= : =?UTF-8?Q?=20La=20compra=20de=20tu=20vehi?= =?UTF-8?Q?culo=20en=20tan=20s=C3=B3lo=2024h?= =?UTF-8?Q?=2E=2E=2E=20=C2=A1Valoraci=C3=B3n=20=26?= =?UTF-8?Q?ago=20=C2=A0inmediato=21?= 


回答6:

Python has an e-mail lib. http://docs.python.org/library/email.header.html

Take a look at email.header.decode_header()



回答7:

I had a similar issue, but my case was a little bit different:

  • Python 3.5 (The question is from 2011, but still very high on google)
  • Read message directly from file as byte-string

Now the cool feature of the python 3 email.parser is that all headers are automatically decoded to Unicode-Strings. However this causes a little "misfortune" when dealing with wrong headers. So following header caused the problem:

Subject: Re: =?ISO-2022-JP?B?GyRCIVYlMyUiMnE1RCFXGyhC?=  (1/9(=?ISO-2022-JP?B?GyRCNmIbKEI=?=) 6:00pm-7:00pm)   =?ISO-2022-JP?B?GyRCJE4kKkNOJGkkOxsoQg==?= 

This resulted in the following msg['subject']:

Re: 「コア会议」 (1/9(=?ISO-2022-JP?B?GyRCNmIbKEI=?=) 6:00pm-7:00pm)  のお知らせ 

Well the issue is uncompliance with RFC 2047 (There should be a line-white-space after the MIME encoded word) as already described in the answer by Ingmar Hupp. So my answer is inspired by his.

Solution 1: Fix byte-string before actually parsing the email. This seemed to be the better solution, however I was struggling to implement a Regex substitution on byte-strings. So I opted for solution 2:

Solution 2: Fix the already parsed and partly-decoded header value:

with open(file, 'rb') as fp:  # read as byte-string     msg = email.message_from_binary_file(fp, policy=policy.default)     subject_fixed = fix_wrong_encoded_words_header(msg['subject'])   def fix_wrong_encoded_words_header(header_value):     fixed_header_value = re.sub(r"(=\?.*\?=)(?=\S)", r"\1 ", header_value)      if fixed_header_value == header_value:  # nothing needed to fix         return header_value     else:         dh = decode_header(fixed_header_value)          default_charset = 'unicode-escape'         correct_header_value = ''.join([str(t[0], t[1] or default_charset) for t in dh])         return correct_header_value 

Explanation of important parts:

I modified the regex of Ingmar Hupp to only replace wrong MIME encoded words: (=\?.*\?=)(?=\S) Debuggex Demo. Because doing for all would heavily slow dow the parsing (Parsing about 150'000 mails).

After applying the decode_header function to the fixed_header, we have following parts in dh:

dh == [(b'Re: \\u300c\\u30b3\\u30a2\\u4f1a\\u8b70\\u300d (1/9(', None),         (b'\x1b$B6b\x1b(B', 'iso-2022-jp'),         (b' ) 6:00pm-7:00pm)  \\u306e\\u304a\\u77e5\\u3089\\u305b', None)] 

To re-decode the unicode-escaped sequences, we set default_charset = 'unicode-escape' when building the new header-value.

The correct_header_value is now:

Re: 「コア会议」 (1/9(金 ) 6:00pm-7:00pm)  のお知らせ' 

I hope this will save somebody some time.

Addition: The answer by Sander Steffann didn't really help me, because I wasn't able to get the raw-value of the header-field out of the message-class.



回答8:

for me this worked perfect (and always gives me a string):

dmsgsubject, dmsgsubjectencoding = email.header.decode_header(msg['Subject'])[0] msgsubject = dmsgsubject.decode(*([dmsgsubjectencoding] if dmsgsubjectencoding else [])) if isinstance(dmsgsubject, bytes) else dmsgsubject 


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!