Extract text from “content-disposition: attachment” body part

蓝咒 提交于 2020-01-07 06:15:23

问题


I regularly receive a generated email message containing a text part and a text attachment. I want to test if attachment is base64 encoded, then decode it like:

:0B
* ^(Content-Transfer-Encoding: *base64(($)[a-z0-9].*)*($))
{
 msgID=`printf '%s' "$MATCH" | base64 -d`
}

But it always say invalid input, anyone know what's wrong?

procmail: Match on "^()\/[a-z]+[0-9]+[^\+]"
procmail: Assigning "msgID=PGh0b"
procmail: matched "^(Content-Disposition: *attachment.*(($)[a-z0-9].*)*    |Content-Transfer-Encoding: *base64(($)[a-z0-9].*)*($)"

procmail: Executing "printf '%s' "$MATCH" | base64 -d"
base64: invalid input
procmail: Assigning "msgID=<ht"
procmail: Unexpected EOL


procmail: Assigning "msgID=PGh0b"
procmail: Match on "^(Content-Transfer-Encoding: *base64(($)[a-z0-9].*)*($))"
procmail: Executing "printf '%s' "$MATCH" | base64 -d"
base64: invalid input
procmail: Assigning "msgID=<ht"
procmail: Unexpected EOL

回答1:


If your requirements are complex, it might be easier to write a dedicated script which extracts the information you want -- a modern scripting language with proper MIME support is going to be a lot more versatile when it comes to all the myriad different possibilities for content encoding and body part structure in modern MIME email.

The following finds the first occurrence of MIME headers with Content-Disposition: attachment and extracts the first token of the following body. This might do what you want if you are corresponding with a sender who uses a well-defined, static template. There is no real MIME parsing here, so (say) a forwarded message which happens to contain an embedded part which matches the pattern will also trigger the conditions. (This can be a bug, or a feature.)

A useful but not frequently used feature of Procmail is the ability to write a regular expression which spans multiple lines. Within a regex, ($) always matches a literal newline. So with that, we can look for a Content-Disposition: attachment header followed by other headers (zero or more) followed by an empty line, followed by the token you want to extract.

:0B
* ^Content-Disposition: *attachment.*(($)[A-Z].*)*($)($)\/[A-Z]+[0-9]+
{ msgid="$MATCH" }

For simplicity, I have not attempted to cope with multi-line MIME headers. If you want to support that, the fix should be reasonably obvious, though not at all elegant.

In the somewhat more general case, you might want to add a condition to check that the group of MIME headers in the condition also contains a Content-type: text/plain; you will need to set up two alternatives for having Content-type: before or after Content-disposition: (or somehow normalize the MIME headers before getting to this recipe; or trust that the sender always generates them in exactly the order in the sample message).



来源:https://stackoverflow.com/questions/32292295/extract-text-from-content-disposition-attachment-body-part

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!