How to split text into paragraphs using NLTK nltk.tokenize.texttiling?

前端 未结 2 1403
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-15 11:05

I found this Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a tex

相关标签:
2条回答
  • 2021-01-15 11:48

    What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?

    The texttiling algorithm {1,4,5} isn't designed to perform sequential text classification {2,3} (which is the task you described). Instead, from http://people.ischool.berkeley.edu/~hearst/research/tiling.html:

    TextTiling is [an unsupervised] technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.


    References:

    • {1} Marti A. Hearst, Multi-Paragraph Segmentation of Expository TextProceedings of the 32nd Meeting of the Association for Computational Linguistics, Los Cruces, NM, June, 1994. pdf
    • {2} Lee, J.Y. and Dernoncourt, F., 2016, June. Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 515-520). https://www.aclweb.org/anthology/N16-1062.pdf
    • {3} Dernoncourt, Franck, Ji Young Lee, and Peter Szolovits. "Neural Networks for Joint Sentence Classification in Medical Paper Abstracts." In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 694-700. 2017. https://www.aclweb.org/anthology/E17-2110.pdf
    • {4} Hearst, M. TextTiling: Segmenting Text into Multi-Paragraph Subtopic PassagesComputational Linguistics, 23 (1), pp. 33-64, March 1997. pdf
    • {5} Pevzner, L., and Hearst, M., A Critique and Improvement of an Evaluation Metric for Text SegmentationComputational Linguistics, 28 (1), March 2002, pp. 19-36. pdf
    0 讨论(0)
  • 2021-01-15 11:51

    What about using splitlines? Or do you have to use the nltk package?

    email = """    From: X
        To: Y                             (LOGISTICS)
        Date: 10/03/2017
    
        Hello team,                       (INTRO)
    
        Some text here representing
        the body                          (BODY)
        of the text.
    
        Regards,                          (OUTRO)
        X
    
        *****DISCLAIMER*****              (POST EMAIL DISCLAIMER)
        THIS EMAIL IS CONFIDENTIAL
        IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""
    
    y = [s.strip() for s in email.splitlines()]
    
    print(y)
    
    0 讨论(0)
提交回复
热议问题