Extract email message itself from all its prior messages and meta data (Sendgrid Parse API/PHP)?

风格不统一 提交于 2019-12-04 05:39:32
Swift

You have a couple of options:

1) Insert a token that splits the emails

You could do something like --- reply above this line --- and then cut out everything below that token.

2) Use an email reply parsing library

There is a really good one done by github, but it's in ruby. There's a php port though that might be good for what you need:

Fully working code:

<?php
  require_once 'application/third_party/EmailReplyParser-master/src/autoload.php';
  $email = new \EmailReplyParser\Email();
  $reply = $email->read($_POST['text']);            
  $message=$reply[0]->getContent();
  $message=preg_replace('~On(.*?)wrote:(.*?)$~si', '', $message); 
  //Last line is needed for some email clients, e.g., some university e-mails: foo@bar.edu but not Gmail or Hotmail, to get rid of "On Jan 23...wrote:" 
  //This failure to remove "On Jan 23...wrote:" is a known issue and is documented in their README

 ?>

There's simply no guaranteed way to parse quoted message threads from an email message, so you won't find a regex or any other code that will work in all cases. There's no standard to define formatting of replies, and as you've already observed different mail clients use different conventions. Many, in fact, will allow the user to edit the quoted text. Also, users can paste in unrelated messages, with or without headers, resulting in a mix-and-match of formats.

If you can record and keep the history of all messages as they are sent and received, then you can (usually, but not always) use the In-Reply-To header (see RFC-5322) to locate the previous message by matching it's Message-ID header, and do a diff on the body and remove duplicate text runs. It's apparent that some email systems do this to improve their presentations, but I'm not aware of any available open source code.

// cut quoted text, https://regex101.com/r/xO8nI1/5

    $message = preg_replace('/(On\s.*<\n){0,1}(.*\n(\n){0,1}((^>+\s?.*$)+\n?)+)/mi', '', $message);

How about replies in languages other than English? We came out with solution to add marker, but instead of translating it for every email (based on user's language) we put some invisible characters into it (zero width space U+200B , to be precise). Basing on "On..." regexp it's error prone, it can easily cut some email content.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!