Algorithm to match natural text in mail

问题

I need to separate natural, coherent text/sentences in emails from lists, signatures, greetings and so on before further processing.

example:

Hi tom,

last monday we did bla bla, lore Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua.

list item 2

list item 3

list item 3

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid x ea commodi consequat. Quis aute iure reprehenderit in voluptate velit

regards, K.

---line-of-funny-characters-#######

example inc.

33 evil street, london

mobile: 00 234534/234345

Ideally the algorithm would match only the bold parts.

Is there any recommended approach - or are there even existing algorithms for that problem? Should I try approximate regular expressions or more statistical stuff based on number of punctation marks, length and so on?

回答1:

You need to go through serious NLP stuff to get the desired processing done (depends on what level of precision you are expecting and the randomness and vagueness of the input email data for your code).

Read this one. See the references section for other relevant stuff.

This deals with different issue of classification but involves operating on the email text.

回答2:

In the example you post, line length suffices.

There is no perfect algorithm; even human beings will classify lines differently.

I suggest just use line length until you find a counter example, at which point revise your algorithm. Repeat until problem solved to your satisfaction.

回答3:

You'll need many heuristics to get an approximation of a solution, so here's one: you can safely cut off anything after a sigdash (hyphen-hyphen-space), which standards-conforming e-mail messages use to separate the message body from the signature.

Another approach you can use is to store copies of e-mails from the same sender; this should allow you to extract things that are the same or similar in every message (such as salutations and signatures) and detect how their mail client does quoting.

回答4:

If your only task is to fish out the bold parts, look on how the bold text technically implemented in your mail database. For example, if it's html, you could have something like this:

Hi tom,

last monday we did bla bla, lore Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua.
list item 2
list item 3
list item 3
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid x ea commodi consequat. Quis aute iure reprehenderit in voluptate velit

regards, K.

---line-of-funny-characters-#######

Then you can run the following code:

import re
# save the mail above as variable MailAbove
print re.findall(r'<b>(.*?)</b>',MailAbove)

Result:

['last monday we did bla bla, lore Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua.', 'Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid x ea commodi consequat. Quis aute iure reprehenderit in voluptate velit']

Edit: It follows from the comment that I misunderstood the question. Generally, such tasks are a multiple stage process: you apply some methods, then see the result and what is missing out or is in by mistake, then you make fixes or add new methods and see what's the outcome.
I recommend you to read this - an excellent tutorial/book on solving the tasks like yours and beyond.

来源：https://stackoverflow.com/questions/10046451/algorithm-to-match-natural-text-in-mail

标签

python

regex

algorithm

nlp