NLTK corpus reader paragraph

纵饮孤独 提交于 2019-12-12 03:04:00

问题


I tried to copy paste content from word document (.docx) to a .txt file and made it read by a nltk corpus reader to find number of paragraph. It returns almost 30 paragraph as one paragraph. I manually entered a line break in .txt file and it returned 30 paragraphs.

import nltk
corpusReader = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".", "d.txt")
print "Paragraphs =", len(corpusReader.paras())
  1. Is it possible for PlaintextCorpus reader to read .docx?
  2. While copy pasting from .docx to .txt, How to preserve line break?
  3. Is there a way using python,where I open .txt file and find ?!or . or ... and followed by some blank spaces(4 in number) and press "enter" to create line break automatically? break.

Edit 1.

Walked the para_block_reader=read_line_block path, but it always gives one paragraph count extra.

import nltk
from nltk.corpus.reader.util import *
corpusReader = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".", "d.txt",para_block_reader=read_line_block)
print "Paragraphs =", len(corpusReader.paras())

回答1:


The plaintext corpus reader can only read plain-text files. There are Python libraries that can read docx, but that will not address your problem, which is that Word delimits paragraphs by a single line break, but plaintext documents traditionally understand a paragraph boundary to be a blank line-- i.e., two successive newlines. In other words, your export method does preserve the newlines; it's just that there's not enough of them.

So there is an easy way to fix up your texts so that paragraphs are recognized without extra to-do: Once you've written out your plaintext file (which you can do from Word's Save As... menu or by cutting and pasting), post-process it like this (add encoding= arguments as necessary):

with open("my_plaintext.txt") as oldfile:
    content = oldfile.read()

content = re.sub("\n", "\n\n", content)

with open("my_plaintext_fixed.txt", "w") as newfile:
    newfile.write(content)

You can now read myplaintext_fixed.txt" with thePlaintextCorpusReader`, and everything will work as expected.




回答2:


The source code for PlainTextCorpus reader is the first class defined on this page, it is fairly simple.

It has sub-components, if you don't secify them in the constructor it uses the NLTK defaults

  • para_block_reader (default: read_blankline_block), which says how the document is broken up into paragraphs.
  • sentence_tokenizer (default: English Punkt), which says how to break a paragraph into sentences
  • word_tokenizer (default WordPunctTokenizer()), which says how to break a sentence into tokens (words, and symbols).

Note that the defaults may change in different versions, on NLTK. I feel like the default word_tokenizer used to be the Penn tokenizer.

Re: 1.

No PlaintextCorpus reader can not read Docx. It only reads plain text. I'm sure you can find a python library to convert it

Re 2

Copy and Paste is offtopic for this site, try SuperUser. I suggest though you instead use option 1 and get a library to do the conversion.

Re 3

Yes, you can do a search and replace using Regex.

 import re
 def breakup(mystring):
      return re.replace(mystring, r"(\.|\!|\.\.\.)    ", "\n")

But perhaps instead you might want to swap out your para_block_reader or sent_tokenizer



来源:https://stackoverflow.com/questions/39971017/nltk-corpus-reader-paragraph

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!