Count most commonly used words in a txt file

问题

I'm trying to get a list of the 10 most commonly used words in a txt file with the end goal of building a word cloud. The following code does not produce anything when I print.

>>> import collections
>>> from collections import Counter
>>> file = open('/Users/Desktop/word_cloud/98-0.txt')
>>> wordcount={}
>>> d = collections.Counter(wordcount)
>>> for word, count in d.most_common(10):
    print(word, ": ", count)

回答1:

Actually, I would recommend that you continue to use Counter. It's a really useful tool for, well, counting things, but it has really expressive syntax, so you don't need to worry about sorting anything. Using it, you can do:

from collections import Counter

#opens the file. the with statement here will automatically close it afterwards.
with open("input.txt") as input_file:
    #build a counter from each word in the file
    count = Counter(word for line in input_file
                         for word in line.split())

print(count.most_common(10))

With my input.txt, this has the output of

[('THE', 27643), ('AND', 26728), ('I', 20681), ('TO', 19198), ('OF', 18173), ('A', 14613), ('YOU', 13649), ('MY', 12480), ('THAT', 11121), ('IN', 10967)]

I've changed it a bit so it doesn't have to read the whole file into memory. My input.txt is my punctuationless version of the works of shakespeare, to demonstrate that this code is fast. It takes about 0.2 seconds on my machine.

Your code was a bit haphazard - it looks like you've tried to bring together several approaches, keeping bits of each here and there. My code has been annotated with some explanatory functions. Hopefully it should be relatively straightforward, but if you're still confused about anything, let me know.

回答2:

You haven't pulled anything from the .txt file yet. What does the inside of the text file look like? If you want to classify words as groups of characters separated by spaces, you could get a list of the words with:

with open('path/to/file.txt', 'r') as f:
    words = ' '.split(f.read())

Then to get the 10 most common (there's probably more efficient ways but this is what I found first):

word_counter = {}
for word in words:
    if word in word_counter:
        word_counter[word] += 1
    else:
        word_counter[word] = 1

popular_words = sorted(word_counter, key = word_counter.get, reverse = True)

print popular_words[:10]

来源：https://stackoverflow.com/questions/45822827/count-most-commonly-used-words-in-a-txt-file

标签

python

count