I have many emails coming in from different sources. they all have attachments, many of them have attachment names in chinese, so these names are converted to base64 by thei
Please note both
Content-Transfer-Encodinghave base64
Not relevant in this case, the Content-Transfer-Encoding only applies to the body payload, not to the headers.
=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?=
That's an RFC2047-encoded header atom. The stdlib function to decode it is email.header.decode_header. It still needs a little post-processing to interpret the outcome of that function though:
import email.header
x= '=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?='
try:
name= u''.join([
unicode(b, e or 'ascii') for b, e in email.header.decode_header(x)
])
except email.Errors.HeaderParseError:
pass # leave name as it was
However...
Content-Type: application/vnd.ms-excel;
name="=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?="
This is simply wrong. What mailer created it? RFC2047 encoding can only happen in atoms, and a quoted-string is not an atom. RFC2047 §5 explicitly denies this:
- An 'encoded-word' MUST NOT appear within a 'quoted-string'.
The accepted way to encode parameter headers when long string or Unicode characters are present is RFC2231, which is a whole new bag of hurt. But you should be using a standard mail-parsing library which will cope with that for you.
So, you could detect the '=?' in filename parameters if you want, and try to decode it via RFC2047. However, the strictly-speaking-correct thing to do is to take the mailer at its word and really call the file =?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?=!