I found this Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a tex
What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?
The texttiling algorithm {1,4,5} isn't designed to perform sequential text classification {2,3} (which is the task you described). Instead, from http://people.ischool.berkeley.edu/~hearst/research/tiling.html:
TextTiling is [an unsupervised] technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.
References:
What about using splitlines
? Or do you have to use the nltk package?
email = """ From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""
y = [s.strip() for s in email.splitlines()]
print(y)