Where to find body of email depending of mimeType

后端 未结 5 1876
小蘑菇
小蘑菇 2020-12-01 09:53

I am making a request to the User.messages endpoint. All objects returned (the emails) have a mimeType property which I\'m struggling to understand.

More specificall

相关标签:
5条回答
  • 2020-12-01 10:17

    I know this question is not new but I've wrote a PHP script which correctly parses messages pulled from Gmail API, including any type of attachment.

    The script includes a recursive "iterateParts" function which iterates all message parts so we can be sure we extracted all available data from each message.

    Script steps are:

    1. Pull all message ids from API
    2. Get some important headers (subject & from address)
    3. Either body is directly on payload or send payload to iterateParts
    4. iterateParts is parsing each message to $msgArr with it's data, base64 encoded
    5. Push $msgArr to master array $allmsgArr
    6. Traverse master array and save each part as file according to it's MIME type and filename
    
        $maxToPull = 1;
        $gmailQuery = "ALL";
    
        // Initializing Google API
        $service = new Google_Service_Gmail($client);
    
        // Pulling all gmail messages into $messages array
        $user = 'me';
        $msglist = $service->users_messages->listUsersMessages($user, ["maxResults"=>$maxToPull, "q"=>$gmailQuery]);
        $messages = $msglist->getMessages();
    
        // Master array that will hold all parsed messages data, including attachments
        $allmsgArr = array();
    
        // Traverse each message
        foreach($messages as $message)
        {
            $msgArr = array();
            $single_message = $service->users_messages->get('me', $message->getId());
            $payload = $single_message->getPayload();
    
            // Nice to have the gmail msg id, can be used to direct access the message in Gmail's web gui
            $msgArr['gmailmsgid'] = $message->getId();
    
            // Retrieving the subject and "from" email address
            foreach($payload->getheaders() as $oneheader)
            {
                if($oneheader['name'] == 'Subject')
                    $msgArr['subject'] = $oneheader['value'];
                if($oneheader['name'] == 'From')
                    $msgArr['fromaddress'] = substr($oneheader['value'], strpos($oneheader['value'], '<')+1, -1);
            }
    
            // If body is directly in the message payload (only for plain text messages where there's no HTML part and no attachments, normally this is not the case)
            if($payload['body']['size'] > 0)
                $msgArr['textplain'] = $payload['body']['data'];     
            // Else, iterate over each message part and continue to dig if necessary
            else
                iterateParts($payload, $message->getId());
    
            // Push the parsed $msgArr (parsed by iterateParts) to master array
            array_push($allmsgArr, $msgArr);
        }
    
    
        // Traverse each parsed message and saving it's content and attachments to files
        foreach($allmsgArr as $onemsgArr)
        {
    
            $folder = "messages/".$onemsgArr['gmailmsgid'];
            mkdir($folder);
    
            if($onemsgArr['textplain'])
                file_put_contents($folder."/textplain.txt", decodeData($onemsgArr['textplain']));
            if($onemsgArr['texthtml'])
                file_put_contents($folder."/texthtml.html", decodeData($onemsgArr['texthtml']));
            if($onemsgArr['attachments'])
            {
                foreach($onemsgArr['attachments'] as $oneattachment)
                {
                    if(!empty($oneattachment['filename']))
                        $filename = $oneattachment['filename'];
                    else if($oneattachment['mimetype'] == "message/rfc822" && empty($oneattachment['filename'])) // email attachments
                        $filename = "noname.eml";
                    else
                        $filename = "unknown";
                    file_put_contents($folder."/".$filename, decodeData($oneattachment['data']));
                }
            }
        }
    
    
        function iterateParts($obj, $msgid) {
    
            global $msgArr;
            global $service;
            foreach($obj as $parts)
            {
                // if found body data
                if($parts['body']['size'] > 0)
                {
                    // plain text representation of message body
                    if($parts['mimeType'] == 'text/plain')
                    {
                        $msgArr['textplain'] = $parts['body']['data'];
                    }
                    // html representation of message body
                    else if($parts['mimeType'] == 'text/html')
                    {
                        $msgArr['texthtml'] = $parts['body']['data'];
                    }
                    // if it's an attachment
                    else if(!empty($parts['body']['attachmentId']))
                    {
                        $attachArr['mimetype'] = $parts['mimeType'];
                        $attachArr['filename'] = $parts['filename'];
                        $attachArr['attachmentId'] = $parts['body']['attachmentId'];
    
                        // the message holds the attachment id, retrieve it's data from users_messages_attachments
                        $attachmentId_base64 = $parts['body']['attachmentId'];
                        $single_attachment = $service->users_messages_attachments->get('me', $msgid, $attachmentId_base64);
    
                        $attachArr['data'] = $single_attachment->getData();
    
                        $msgArr['attachments'][] = $attachArr;
                    }       
                }
    
                // if there are other parts inside, go get them
                if(!empty($parts['parts']) && !empty($parts['mimeType']) && empty($parts['body']['attachmentId']))
                {
                    iterateParts($parts->getParts(), $msgid);
                }
    
            }
        }
    
        // All data returned from API is base64 encoded
        function decodeData($data)
        {
            $sanitizedData = strtr($data,'-_', '+/');
            return base64_decode($sanitizedData);
        }
    
    

    This is how $allmsgArr will look like (where only one message was pulled):

    
    Array
    (
        [0] => Array
            (
                [gmailmsgid] => 25k1asfa556x2da
                [fromaddress] => john@gmail.com
                [subject] => Fwd: Sea gulls picture
                [textplain] => UE5SIDQxQzAwMg0KDQpBUkJFTFRFU1QxDQoNCg0K
                [texthtml] => PGRpdiBkaXI9Imx0ciI-PHNwYW4gc3R5bGU9ImZi
                [attachments] => Array
                    (
                        [0] => Array
                            (
                                [mimetype] => image/png
                                [filename] => sea_gulls.png
                                [attachmentId] => ANGjdJ9tmy4d8vPXhU_BjNEFEaDODOpu29W2u5OTM7a0
                                [data] => iVBORw0KGgoAAAANSUhEUgAABSYAAAKWCAYAAABUP
                            )
    
                        [1] => Array
                            (
                                [mimetype] => image/jpeg
                                [filename] => Outlook_Signature.jpg
                                [attachmentId] => ANGjdJ-CgZTK0oK44Q8j7TlN_JlaexxGKZ_wHFfoEB
                                [data] => 6jRXhpZgAATU0AKgAAAAgABwESAAMAAAABAAEAAAEa
                            )
    
                    )
            )
    )
    
    
    0 讨论(0)
  • 2020-12-01 10:18

    I think it will make sense if you think of the payload as a part in of itself. Let's say I send a message with just a subject and a plain message text:

    From: emtholin@gmail.com
    To: emtholin@gmail.com
    Subject: Example Subject
    
    This is the plain text message
    

    This will result in the following parsed message:

    {
     "id": "154ecb53c10b74d8",
     "threadId": "154ecb53c10b74d8",
     "labelIds": [
      "INBOX",
      "SENT"
     ],
     "snippet": "This is the plain text message",
     "historyId": "38877",
     "internalDate": "1464260181000",
     "payload": {
      "partId": "",
      "mimeType": "text/plain",
      "filename": "",
      "headers": [
       ...
      ],
      "body": {
       "size": 31,
       "data": "VGhpcyBpcyB0aGUgcGxhaW4gdGV4dCBtZXNzYWdlCg=="
      }
     },
     "sizeEstimate": 355
    }
    

    If I send a message with a plain text part, a html part and an image, it will look like this when parsed:

    {
     "id": "154ed5ccaa12f3df",
     "threadId": "154ed5ccaa12f3df",
     "labelIds": [
      "SENT",
      "INBOX",
      "IMPORTANT"
     ],
     "snippet": "This is a plain/html message with an image.",
     "historyId": "841379",
     "internalDate": "1464271162000",
     "payload": {
      "mimeType": "multipart/mixed",
      "filename": "",
      "headers": [
         ...
      ],
      "body": {
       "size": 0
      },
      "parts": [
       {
        "mimeType": "multipart/alternative",
        "filename": "",
        "headers": [
         {
          "name": "Content-Type",
          "value": "multipart/alternative; boundary=089e0122896c7c80d80533bf3205"
         }
        ],
        "body": {
         "size": 0
        },
        "parts": [
         {
          "partId": "0.0",
          "mimeType": "text/plain",
          "filename": "",
          "headers": [
           {
            "name": "Content-Type",
            "value": "text/plain; charset=UTF-8"
           }
          ],
          "body": {
           "size": 47,
           "data": "VGhpcyBpcyBhIHBsYWluL2h0bWwgKm1lc3NhZ2UqIHdpdGggYW4gaW1hZ2UuDQo="
          }
         },
         {
          "partId": "0.1",
          "mimeType": "text/html",
          "filename": "",
          "headers": [
           {
            "name": "Content-Type",
            "value": "text/html; charset=UTF-8"
           }
          ],
          "body": {
           "size": 73,
           "data": "PGRpdiBkaXI9Imx0ciI-VGhpcyBpcyBhIHBsYWluL2h0bWwgPGI-bWVzc2FnZTwvYj4gd2l0aCBhbiBpbWFnZS48L2Rpdj4NCg=="
          }
         }
        ]
       },
       {
        "partId": "1",
        "mimeType": "image/png",
        "filename": "smile.png",
        "headers": [
           ...
        ],
        "body": {
         "attachmentId": "ANGjdJ-OrSy7VAYL-UbRyNtmySbZLlV-fV43zJF0_neNGZ8yKugsZAxb32eSb-CrbYIhF9NvjGwBVEjSkRrUWoCS7aDpgoQnt9WR7f2sa17qVEyOg_JVSbrGrunirvQw2dY-SxxB3Y0JP3aYDHSBXpNO6fFCByVFWQDw1et5Mh9di7bGO4AWOLKFVe_Yb2RmdDwuazGXGb8zA88TTMaiEPIacPTNiVtBrIWG0EKGxHBhep9j8ujyWeCS5P9X80dBHvBNj4T9XjUwcrN6FvwegRewRMM9cBupY7jQESR7915OcbhCNyi5l64x6vVh1ZU",
         "size": 2002
        }
       }
      ]
     },
     "sizeEstimate": 3077
    }
    

    You will see it's just the RFC822-message parsed to JSON. If you just traverse the parts, and treat the payload as a part itself, you will find what you are looking for.

    var parts = [response.payload];
    
    while (parts.length) {
      var part = parts.shift();
      if (part.parts) {
        parts = parts.concat(part.parts);
      }
    
      if(part.mimeType === 'text/html') {
        var decodedPart = decodeURIComponent(escape(atob(part.body.data.replace(/\-/g, '+').replace(/\_/g, '/'))));
        console.log(decodedPart);
      }
    }
    
    0 讨论(0)
  • 2020-12-01 10:28

    There are many MIME types that can be returned, here are a few:

    • text/plain: the message body only in plain text
    • text/html: the message body only in HTML
    • multipart/alternative: will contain two parts that are alternatives for each othe, for example:
      • a text/plain part for the message body in plain text
      • a text/html part for the message body in html
    • multipart/mixed: will contain many unrelated parts which can be:
      • multipart/alternative as above, or text/plain or text/html as above
      • application/octet-stream, or other application/* for application specific mime types for attachments
      • image/png ot other image/* for images, which could be embedded in the message.

    The definitive reference for all this is RFC 2046 https://www.ietf.org/rfc/rfc2046.txt (you might want to also see 2044 and 2045)

    To answer your question, build a tree of the message, and look either for:

    • the first text/plain or text/html part (either in the message body or in a multipart/mixed)
    • the first text/plain or text/html inside of a multipart/alternative, which may be part of a multipart mixed.

    An example of a complex message:

    • multipart/mixed

      • multipart/alternative
        • text/plain <- message body in plain text
        • text/html <- message body in HTML
      • application/zip <- a zip file attachment
    • -
    0 讨论(0)
  • 2020-12-01 10:36

    Based on the Tholle idea, I've completed his script to extract Gmail body and attachments.

    First of all, you should fetch any gmail-message object and then parse it. You can fetch any gmail-message with this code:

    const {google} = require('googleapis')
    // do your authenticatoin here
    const oAuth2Client = new google.auth.OAuth2(client_id, client_secret, redirectTo)
    const gmail = google.gmail({ version: 'v1', auth: oAuth2Client })
    
    const response = await this.gmail.users.messages.get({
      auth: oAuth2Client,
      userId: 'me',
      id: messageId,
      format: 'full'
    })
    
    const message_obj = response.data
    

    Main Script:

    function parser(response) {
    
      function decode(input) {
        const text = new Buffer.from(input, 'base64').toString('ascii')
        return decodeURIComponent(escape(text))
      }
    
      function decode_alternative(input) {
        // this way does not escape special "B" characters
        // const text = Buffer.from(input, 'base64').toString('ascii')
        // return decodeURIComponent(escape(text))
    
        return base64.decode(input.replace(/-/g, '+').replace(/_/g, '/'))
      }
    
      const result = {
       text: '',
       html: '',
       attachments: []
      }
    
      let parts = [response.payload]
    
      while (parts.length) {
        let part = parts.shift()
    
        if (part.parts)
          parts = parts.concat(part.parts)
    
        if (part.mimeType === 'text/plain')
          result.text = decode(part.body.data)
    
        if (part.mimeType === 'text/html')
          result.html = decode(part.body.data)
    
    
        if (part.body.attachmentId) {
          result.attachments.push({
            'partId': part.partId,
            'mimeType': part.mimeType,
            'filename': part.filename,
            'body': part.body
          })
        }
      }
    
      return result
    }
    

    Sample Data and response:

    const with_multi_type_attachments = {
      "id": "16c624e85dfd9883",
      "threadId": "16c62397458f34b1",
      "labelIds": [],
      "snippet": "This is body. Inline-attachments my-custom-link my-custom-email-address Emoji:                                                                     
    0 讨论(0)
  • 2020-12-01 10:41

    I resolved this using a recursive function, in this way obtains all the text of the message without import the level of depth of the Json answer. If need more explication, please tell me.

     private List<string> ObtenerTextoMensaje(IList<MessagePart> partes)
        {
            var listaTextos = new List<string>();
            foreach(var elementoParte in partes)
            {
                if ((elementoParte.MimeType == "text/plain")|| (elementoParte.MimeType == "text/html"))
                {
                    if (elementoParte.Body.Size != 0)
                    {
                        listaTextos.Add(decodificarBase64(elementoParte.Body.Data));                        
                    }
                }
                else
                {
                    if(elementoParte.Parts!=null)
                    listaTextos = ObtenerTextoMensaje(elementoParte.Parts);
                }
            }
            return listaTextos;
        }
    
    0 讨论(0)
提交回复
热议问题