Is there a way to get the attachment names from a PST file?

独自空忆成欢 提交于 2020-01-16 14:35:16

问题


I'm working on a python script using pypff to open Outlook PST files and extract useful information. I'm following the code posted in here.

I'm trying to get the names of the attachments for each email but the only methods for type 'attachment' is get_size(), read_buffer() and seek_offset(), which aren't useful to me.

The read_buffer method gives a long string, something like x00\x11\x00\x02\x01\x02\x02\x01\x03\x04\x07\x05\...

How can I decode it?


回答1:


you can try decoding with ascii first.

print((msg.get_attachment(0).read_buffer(attach_size)).decode('ascii', errors="ignore"))

I think Microsoft is using more than one way to encode different parts of attachments, so no single decoding can do perfectly. If ascii cannot decode enough content, you can try them all. For different Python versions, check it out here.

# 98 encodings in python3.5/6/7
decode = ['ascii','big5','big5hkscs','cp037','cp273',
          'cp424','cp437','cp500','cp720','cp737',
          'cp775','cp850','cp852','cp855','cp856',
          'cp857','cp858','cp860','cp861','cp862',
          'cp863','cp864','cp865','cp866','cp869',
          'cp874','cp875','cp932','cp949','cp950',
          'cp1006','cp1026','cp1125','cp1140','cp1250',
          'cp1251','cp1252','cp1253','cp1254','cp1255',
          'cp1256','cp1257','cp1258','cp65001','euc_jp',
          'euc_jis_2004','euc_jisx0213','euc_kr','gb2312','gbk',
          'gb18030','hz','iso2022_jp','iso2022_jp_1','iso2022_jp_2',
          'iso2022_jp_2004','iso2022_jp_3','iso2022_jp_ext','iso2022_kr','latin_1',
          'iso8859_2','iso8859_3','iso8859_4','iso8859_5','iso8859_6',
          'iso8859_7','iso8859_8','iso8859_9','iso8859_10','iso8859_11',
          'iso8859_13','iso8859_14','iso8859_15','iso8859_16','johab',
          'koi8_r','koi8_t','koi8_u','kz1048','mac_cyrillic',
          'mac_greek','mac_iceland','mac_latin2','mac_roman','mac_turkish',
          'ptcp154','shift_jis','shift_jis_2004','shift_jisx0213','utf_32',
          'utf_32_be','utf_32_le','utf_16','utf_16_be','utf_16_le',
          'utf_7','utf_8','utf_8_sig']

# Select the best decoder
items = []
for item in encode:
    attach_size = msg.get_attachment(0).get_size()
    content = (msg.get_attachment(0).read_buffer(attach_size)).decode(item, errors="ignore")

    # I know 'sample_content' is in the attachment, so it's easy to see which ones can decode it.
    if 'sample_content' in content:
        items.append(item)

print(items)

If you don't know what's in the content, you can try workarounds. For instance, in the loop you can find one decoding that leaves least number of "\x", since before encoding your content looks like this "\x93\x93\xfa\x8c\xd3\x1a\xc6".

If anyone has better ways of decoding attachments, please leave a comment here, thank you.



来源:https://stackoverflow.com/questions/55847711/is-there-a-way-to-get-the-attachment-names-from-a-pst-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!