How do the count the number of sentences, words and characters in a file?

前端 未结 7 1354
清歌不尽
清歌不尽 2020-12-10 06:26

I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words

相关标签:
7条回答
  • 2020-12-10 06:28

    Try it this way (this program assumes that you are working with one text file in the directory specified by dirpath):

    import nltk
    folder = nltk.data.find(dirpath)
    corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')
    
    print "The number of sentences =", len(corpusReader.sents())
    print "The number of patagraphs =", len(corpusReader.paras())
    print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
    print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])
    

    Hope this helps

    0 讨论(0)
  • 2020-12-10 06:28
    • Characters are easy to count.
    • Paragraphs are usually easy to count too. Whenever you see two consecutive newlines you probably have a paragraph. You might say that an enumeration or an unordered list is a paragraph, even though their entries can be delimited by two newlines each. A heading or a title too can be followed by two newlines, even-though they're clearly not paragraphs. Also consider the case of a single paragraph in a file, with one or no newlines following.
    • Sentences are tricky. You might settle for a period, exclamation-mark or question-mark followed by whitespace or end-of-file. It's tricky because sometimes colon marks an end of sentence and sometimes it doesn't. Usually when it does the next none-whitespace character would be capital, in the case of English. But sometimes not; for example if it's a digit. And sometimes an open parenthesis marks end of sentence (but that is arguable, as in this case).
    • Words too are tricky. Usually words are delimited by whitespace or punctuation marks. Sometimes a dash delimits a word, sometimes not. That is the case with a hyphen, for example.

    For words and sentences you will probably need to clearly state your definition of a sentence and a word and program for that.

    0 讨论(0)
  • 2020-12-10 06:40

    The only way you can solve this is by creating an AI program that uses Natural Language Processing which is not very easy to do.

    Input:

    "This is a paragraph about the Turing machine. Dr. Allan Turing invented the Turing Machine. It solved a problem that has a .1% change of being solved."

    Checkout OpenNLP

    https://sourceforge.net/projects/opennlp/

    http://opennlp.apache.org/

    0 讨论(0)
  • 2020-12-10 06:41

    There's already a program to count words and characters-- wc.

    0 讨论(0)
  • 2020-12-10 06:45

    Not 100% correct but I just gave a try. I have not taken all points by @wilhelmtell in to consideration. I try them once I have time...

    if __name__ == "__main__":
       f = open("1.txt")
       c=w=0
       s=1
       prevIsSentence = False
       for x in f:
          x = x.strip()
          if x != "":
            words = x.split()
            w = w+len(words)
            c = c + sum([len(word) for word in words])
            prevIsSentence = True
          else:
            if prevIsSentence:
               s = s+1
            prevIsSentence = False
    
       if not prevIsSentence:
          s = s-1
       print "%d:%d:%d" % (c,w,s)
    

    Here 1.txt is the file name.

    0 讨论(0)
  • 2020-12-10 06:46

    For what it's worth if someone comes along here. This addresses all that the OP's question asked I think. If one uses the textstat package, counting sentences and characters is very easy. There is a certain importance for punctuation at the end of each sentence.

    import textstat
    
    your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
    print("Num sentences:", textstat.sentence_count(your_text))
    print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
    print("Num words:", len(your_text.split()))
    
    0 讨论(0)
提交回复
热议问题