how to tell if a string is base64 or not

前端未结

关注

 6  552

春和景丽 2021-01-03 01:10

I have many emails coming in from different sources. they all have attachments, many of them have attachment names in chinese, so these names are converted to base64 by thei

6条回答

臣服心动 (楼主)

2021-01-03 02:06
Please note both Content-Transfer-Encoding have base64

Not relevant in this case, the Content-Transfer-Encoding only applies to the body payload, not to the headers.
```
=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?=
```
That's an RFC2047-encoded header atom. The stdlib function to decode it is email.header.decode_header. It still needs a little post-processing to interpret the outcome of that function though:
```
import email.header
x= '=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?='
try:
    name= u''.join([
        unicode(b, e or 'ascii') for b, e in email.header.decode_header(x)
    ])
except email.Errors.HeaderParseError:
    pass # leave name as it was
```
However...
```
Content-Type: application/vnd.ms-excel;
 name="=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?="
```
This is simply wrong. What mailer created it? RFC2047 encoding can only happen in atoms, and a quoted-string is not an atom. RFC2047 §5 explicitly denies this:
- An 'encoded-word' MUST NOT appear within a 'quoted-string'.
The accepted way to encode parameter headers when long string or Unicode characters are present is RFC2231, which is a whole new bag of hurt. But you should be using a standard mail-parsing library which will cope with that for you.

So, you could detect the '=?' in filename parameters if you want, and try to decode it via RFC2047. However, the strictly-speaking-correct thing to do is to take the mailer at its word and really call the file =?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?=!
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...