Count most commonly used words in a txt file

拥有回忆 提交于 2020-05-30 08:02:38

问题


I'm trying to get a list of the 10 most commonly used words in a txt file with the end goal of building a word cloud. The following code does not produce anything when I print.

>>> import collections
>>> from collections import Counter
>>> file = open('/Users/Desktop/word_cloud/98-0.txt')
>>> wordcount={}
>>> d = collections.Counter(wordcount)
>>> for word, count in d.most_common(10):
    print(word, ": ", count)

回答1:


Actually, I would recommend that you continue to use Counter. It's a really useful tool for, well, counting things, but it has really expressive syntax, so you don't need to worry about sorting anything. Using it, you can do:

from collections import Counter

#opens the file. the with statement here will automatically close it afterwards.
with open("input.txt") as input_file:
    #build a counter from each word in the file
    count = Counter(word for line in input_file
                         for word in line.split())

print(count.most_common(10))

With my input.txt, this has the output of

[('THE', 27643), ('AND', 26728), ('I', 20681), ('TO', 19198), ('OF', 18173), ('A', 14613), ('YOU', 13649), ('MY', 12480), ('THAT', 11121), ('IN', 10967)]

I've changed it a bit so it doesn't have to read the whole file into memory. My input.txt is my punctuationless version of the works of shakespeare, to demonstrate that this code is fast. It takes about 0.2 seconds on my machine.

Your code was a bit haphazard - it looks like you've tried to bring together several approaches, keeping bits of each here and there. My code has been annotated with some explanatory functions. Hopefully it should be relatively straightforward, but if you're still confused about anything, let me know.




回答2:


You haven't pulled anything from the .txt file yet. What does the inside of the text file look like? If you want to classify words as groups of characters separated by spaces, you could get a list of the words with:

with open('path/to/file.txt', 'r') as f:
    words = ' '.split(f.read())

Then to get the 10 most common (there's probably more efficient ways but this is what I found first):

word_counter = {}
for word in words:
    if word in word_counter:
        word_counter[word] += 1
    else:
        word_counter[word] = 1

popular_words = sorted(word_counter, key = word_counter.get, reverse = True)

print popular_words[:10]


来源:https://stackoverflow.com/questions/45822827/count-most-commonly-used-words-in-a-txt-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!