how to tell if a string is base64 or not

前端 未结 6 552
春和景丽
春和景丽 2021-01-03 01:10

I have many emails coming in from different sources. they all have attachments, many of them have attachment names in chinese, so these names are converted to base64 by thei

6条回答
  •  臣服心动
    2021-01-03 02:06

    Please note both Content-Transfer-Encoding have base64

    Not relevant in this case, the Content-Transfer-Encoding only applies to the body payload, not to the headers.

    =?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?=
    

    That's an RFC2047-encoded header atom. The stdlib function to decode it is email.header.decode_header. It still needs a little post-processing to interpret the outcome of that function though:

    import email.header
    x= '=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?='
    try:
        name= u''.join([
            unicode(b, e or 'ascii') for b, e in email.header.decode_header(x)
        ])
    except email.Errors.HeaderParseError:
        pass # leave name as it was
    

    However...

    Content-Type: application/vnd.ms-excel;
     name="=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?="
    

    This is simply wrong. What mailer created it? RFC2047 encoding can only happen in atoms, and a quoted-string is not an atom. RFC2047 §5 explicitly denies this:

    • An 'encoded-word' MUST NOT appear within a 'quoted-string'.

    The accepted way to encode parameter headers when long string or Unicode characters are present is RFC2231, which is a whole new bag of hurt. But you should be using a standard mail-parsing library which will cope with that for you.

    So, you could detect the '=?' in filename parameters if you want, and try to decode it via RFC2047. However, the strictly-speaking-correct thing to do is to take the mailer at its word and really call the file =?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?=!

提交回复
热议问题