问题
I tried to copy paste content from word document (.docx) to a .txt file and made it read by a nltk corpus reader to find number of paragraph. It returns almost 30 paragraph as one paragraph. I manually entered a line break in .txt file and it returned 30 paragraphs.
import nltk
corpusReader = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".", "d.txt")
print "Paragraphs =", len(corpusReader.paras())
- Is it possible for PlaintextCorpus reader to read .docx?
- While copy pasting from .docx to .txt, How to preserve line break?
- Is there a way using python,where I open .txt file and find ?!or . or ... and followed by some blank spaces(4 in number) and press "enter" to create line break automatically? break.
Edit 1.
Walked the para_block_reader=read_line_block path, but it always gives one paragraph count extra.
import nltk
from nltk.corpus.reader.util import *
corpusReader = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".", "d.txt",para_block_reader=read_line_block)
print "Paragraphs =", len(corpusReader.paras())
回答1:
The plaintext corpus reader can only read plain-text files. There are Python libraries that can read docx, but that will not address your problem, which is that Word delimits paragraphs by a single line break, but plaintext documents traditionally understand a paragraph boundary to be a blank line-- i.e., two successive newlines. In other words, your export method does preserve the newlines; it's just that there's not enough of them.
So there is an easy way to fix up your texts so that paragraphs are recognized without extra to-do: Once you've written out your plaintext file (which you can do from Word's Save As... menu or by cutting and pasting), post-process it like this (add encoding= arguments as necessary):
with open("my_plaintext.txt") as oldfile:
content = oldfile.read()
content = re.sub("\n", "\n\n", content)
with open("my_plaintext_fixed.txt", "w") as newfile:
newfile.write(content)
You can now read myplaintext_fixed.txt" with thePlaintextCorpusReader`, and everything will work as expected.
回答2:
The source code for PlainTextCorpus reader is the first class defined on this page, it is fairly simple.
It has sub-components, if you don't secify them in the constructor it uses the NLTK defaults
para_block_reader(default:read_blankline_block), which says how the document is broken up into paragraphs.sentence_tokenizer(default: English Punkt), which says how to break a paragraph into sentencesword_tokenizer(defaultWordPunctTokenizer()), which says how to break a sentence into tokens (words, and symbols).
Note that the defaults may change in different versions, on NLTK. I feel like the default word_tokenizer used to be the Penn tokenizer.
Re: 1.
No PlaintextCorpus reader can not read Docx. It only reads plain text. I'm sure you can find a python library to convert it
Re 2
Copy and Paste is offtopic for this site, try SuperUser. I suggest though you instead use option 1 and get a library to do the conversion.
Re 3
Yes, you can do a search and replace using Regex.
import re
def breakup(mystring):
return re.replace(mystring, r"(\.|\!|\.\.\.) ", "\n")
But perhaps instead you might want to swap out your para_block_reader or sent_tokenizer
来源:https://stackoverflow.com/questions/39971017/nltk-corpus-reader-paragraph