How to split text into paragraphs using NLTK nltk.tokenize.texttiling?

前端未结

关注

 2  1403

I found this Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a tex

相关标签:

2条回答

无人共我

2021-01-15 11:48
What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?

The texttiling algorithm {1,4,5} isn't designed to perform sequential text classification {2,3} (which is the task you described). Instead, from http://people.ischool.berkeley.edu/~hearst/research/tiling.html:

TextTiling is [an unsupervised] technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.

References:
- {1} Marti A. Hearst, Multi-Paragraph Segmentation of Expository Text. Proceedings of the 32nd Meeting of the Association for Computational Linguistics, Los Cruces, NM, June, 1994. pdf
- {2} Lee, J.Y. and Dernoncourt, F., 2016, June. Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 515-520). https://www.aclweb.org/anthology/N16-1062.pdf
- {3} Dernoncourt, Franck, Ji Young Lee, and Peter Szolovits. "Neural Networks for Joint Sentence Classification in Medical Paper Abstracts." In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 694-700. 2017. https://www.aclweb.org/anthology/E17-2110.pdf
- {4} Hearst, M. TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages, Computational Linguistics, 23 (1), pp. 33-64, March 1997. pdf
- {5} Pevzner, L., and Hearst, M., A Critique and Improvement of an Evaluation Metric for Text Segmentation, Computational Linguistics, 28 (1), March 2002, pp. 19-36. pdf
0 讨论(0)
发布评论:

提交评论
- 加载中...

渐次进展

2021-01-15 11:51

What about using splitlines? Or do you have to use the nltk package?

email = """    From: X
    To: Y                             (LOGISTICS)
    Date: 10/03/2017

    Hello team,                       (INTRO)

    Some text here representing
    the body                          (BODY)
    of the text.

    Regards,                          (OUTRO)
    X

    *****DISCLAIMER*****              (POST EMAIL DISCLAIMER)
    THIS EMAIL IS CONFIDENTIAL
    IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""

y = [s.strip() for s in email.splitlines()]

print(y)

0 讨论(0)