I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words
Try it this way (this program assumes that you are working with one text file in the directory specified by dirpath
):
import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')
print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])
Hope this helps
For words and sentences you will probably need to clearly state your definition of a sentence and a word and program for that.
The only way you can solve this is by creating an AI program that uses Natural Language Processing which is not very easy to do.
Input:
"This is a paragraph about the Turing machine. Dr. Allan Turing invented the Turing Machine. It solved a problem that has a .1% change of being solved."
Checkout OpenNLP
https://sourceforge.net/projects/opennlp/
http://opennlp.apache.org/
There's already a program to count words and characters-- wc
.
Not 100% correct but I just gave a try. I have not taken all points by @wilhelmtell in to consideration. I try them once I have time...
if __name__ == "__main__":
f = open("1.txt")
c=w=0
s=1
prevIsSentence = False
for x in f:
x = x.strip()
if x != "":
words = x.split()
w = w+len(words)
c = c + sum([len(word) for word in words])
prevIsSentence = True
else:
if prevIsSentence:
s = s+1
prevIsSentence = False
if not prevIsSentence:
s = s-1
print "%d:%d:%d" % (c,w,s)
Here 1.txt is the file name.
For what it's worth if someone comes along here. This addresses all that the OP's question asked I think. If one uses the textstat
package, counting sentences and characters is very easy. There is a certain importance for punctuation at the end of each sentence.
import textstat
your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
print("Num sentences:", textstat.sentence_count(your_text))
print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
print("Num words:", len(your_text.split()))