问题
I want to take every word from a text file, and count the word frequency in a dictionary.
Example: 'this is the textfile, and it is used to take words and count'
d = {'this': 1, 'is': 2, 'the': 1, ...}
I am not that far, but I just can't see how to complete it. My code so far:
import sys
argv = sys.argv[1]
data = open(argv)
words = data.read()
data.close()
wordfreq = {}
for i in words:
#there should be a counter and somehow it must fill the dict.
回答1:
If you don't want to use collections.Counter, you can write your own function:
import sys
filename = sys.argv[1]
fp = open(filename)
data = fp.read()
words = data.split()
fp.close()
unwanted_chars = ".,-_ (and so on)"
wordfreq = {}
for raw_word in words:
word = raw_word.strip(unwanted_chars)
if word not in wordfreq:
wordfreq[word] = 0
wordfreq[word] += 1
for finer things, look at regular expressions.
回答2:
Although, using Counter
from collections
library as suggested by @Michael is better approach, but I am adding answer just to improve your code(I believe that will be an answer for new Python learner):
From comment in your code it seem like you wants to improve your code. And I think you are able to read file content in words (while usually I avoid using read()
function and use for line in file_descriptor:
kind of code).
As words
is a string, In for loop, for i in words:
the loop-variable i
is not a word but a char. You are iterating over chars in string instead of iterating over words in string words
. To understand this notice following code snipe:
>>> for i in "Hi, h r u?":
... print i
...
H
i
,
h
r
u
?
>>>
Because iterating over string char by chars instead word by words is not what you wanted, to iterate words by words you should split method/function from string class in Python.
str.split(str="", num=string.count(str)) method returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
Notice below code examples:
Split:
>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?']
loop with split:
>>> for i in "Hi, how are you?".split():
... print i
...
Hi,
how
are
you?
And it looks something like you needs. Except word Hi,
because split()
by default split by whitespaces so Hi,
are kept as a single string (and obviously) that you don't want. To count frequency of words in the file.
One good solution can be that use regex, But first to keep answer simple I answering with replace()
method. The method str.replace(old, new[, max]) returns a copy of the string in which the occurrences of old have been replaced with new, optionally restricting the number of replacements to max.
Now check below code example for what I want to suggest:
>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?'] # it has , with Hi
>>> "Hi, how are you?".replace(',', ' ').split()
['Hi', 'how', 'are', 'you?'] # , replaced by space then split
loop:
>>> for word in "Hi, how are you?".replace(',', ' ').split():
... print word
...
Hi
how
are
you?
Now, how to count frequency:
One way is use Counter as @Michael suggested, but to use your approach in which you wants to start from empty dict. Do something like this code:
words = f.read()
wordfreq = {}
for word in .replace(', ',' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
# ^^ add 1 to 0 or old value from dict
What I am doing?: because initially wordfreq
is empty you can't assign to wordfreq[word]
at first time(it will rise key exception). so I used setdefault dict method.
dict.setdefault(key, default=None) is similar to get()
, but will set dict[key]=default
if key is not already in dict. So for first time when a new word comes I set it with 0
in dict using setdefault
then add 1
and assign to same dict.
I written an equivalent code using with open instead of single open
.
with open('~/Desktop/file') as f:
words = f.read()
wordfreq = {}
for word in words.replace(',', ' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
print wordfreq
That runs like this:
$ cat file # file is
this is the textfile, and it is used to take words and count
$ python work.py # indented manually
{'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2,
'it': 1, 'to': 1, 'take': 1, 'words': 1,
'the': 1, 'textfile': 1}
Using re.split(pattern, string, maxsplit=0, flags=0)
Just change for loop: for i in re.split(r"[,\s]+", words):
, that should produce correct output.
Edit: better to find all alphanumeric character because you may have more than one punctuation symbols.
>>> re.findall(r'[\w]+', words) # manually indent output
['this', 'is', 'the', 'textfile', 'and',
'it', 'is', 'used', 'to', 'take', 'words', 'and', 'count']
use for loop as: for word in re.findall(r'[\w]+', words):
How would I write code without using read()
:
File is:
$ cat file
This is the text file, and it is used to take words and count. And multiple
Lines can be present in this file.
It is also possible that Same words repeated in with capital letters.
Code is:
$ cat work.py
import re
wordfreq = {}
with open('file') as f:
for line in f:
for word in re.findall(r'[\w]+', line.lower()):
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
print wordfreq
Used lower()
to convert upper letter to lower.
output:
$python work.py # manually strip output
{'and': 3, 'letters': 1, 'text': 1, 'is': 3,
'it': 2, 'file': 2, 'in': 2, 'also': 1, 'same': 1,
'to': 1, 'take': 1, 'capital': 1, 'be': 1, 'used': 1,
'multiple': 1, 'that': 1, 'possible': 1, 'repeated': 1,
'words': 2, 'with': 1, 'present': 1, 'count': 1, 'this': 2,
'lines': 1, 'can': 1, 'the': 1}
回答3:
from collections import Counter
t = 'this is the textfile, and it is used to take words and count'
dict(Counter(t.split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1}
Or better with removing punctuation before counting:
dict(Counter(t.replace(',', '').replace('.', '').split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile': 1}
回答4:
The following takes the string, splits it into a list with split(), for loops the list and counts the frequency of each item in the sentence with Python's count function count (). The words,i, and its frequency are placed as tuples in an empty list, ls, and then converted into key and value pairs with dict().
sentence = 'this is the textfile, and it is used to take words and count'.split()
ls = []
for i in sentence:
word_count = sentence.count(i) # Pythons count function, count()
ls.append((i,word_count))
dict_ = dict(ls)
print dict_
output; {'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1}
回答5:
#open your text book,Counting word frequency
File_obj=open("Counter.txt",'r')
w_list=File_obj.read()
print(w_list.split())
di=dict()
for word in w_list.split():
if word in di:
di[word]=di[word] + 1
else:
di[word]=1
max_count=max(di.values())
largest=-1
maxusedword=''
for k,v in di.items():
print(k,v)
if v>largest:
largest=v
maxusedword=k
print(maxusedword,largest)
回答6:
sentence = "this is the textfile, and it is used to take words and count"
# split the sentence into words.
# iterate thorugh every word
counter_dict = {}
for word in sentence.lower().split():
# add the word into the counter_dict initalize with 0
if word not in counter_dict:
counter_dict[word] = 0
# increase its count by 1
counter_dict[word] =+ 1
回答7:
My approach is to do few things from ground:
- Remove punctuations from the text input.
- Make list of words.
- Remove empty strings.
- Iterate through list.
- Make each new word a key into Dictionary with value 1.
- If a word is already exist as key then increment it's value by one.
text = '''this is the textfile, and it is used to take words and count'''
word = '' #This will hold each word
wordList = [] #This will be collection of words
for ch in text: #traversing through the text character by character
#if character is between a-z or A-Z or 0-9 then it's valid character and add to word string..
if (ch >= 'a' and ch <= 'z') or (ch >= 'A' and ch <= 'Z') or (ch >= '0' and ch <= '9'):
word += ch
elif ch == ' ': #if character is equal to single space means it's a separator
wordList.append(word) # append the word in list
word = '' #empty the word to collect the next word
wordList.append(word) #the last word to append in list as loop ended before adding it to list
print(wordList)
wordCountDict = {} #empty dictionary which will hold the word count
for word in wordList: #traverse through the word list
if wordCountDict.get(word.lower(), 0) == 0: #if word doesn't exist then make an entry into dic with value 1
wordCountDict[word.lower()] = 1
else: #if word exist then increament the value by one
wordCountDict[word.lower()] = wordCountDict[word.lower()] + 1
print(wordCountDict)
Another approach:
text = '''this is the textfile, and it is used to take words and count'''
for ch in '.\'!")(,;:?-\n':
text = text.replace(ch, ' ')
wordsArray = text.split(' ')
wordDict = {}
for word in wordsArray:
if len(word) == 0:
continue
else:
wordDict[word.lower()] = wordDict.get(word.lower(), 0) + 1
print(wordDict)
回答8:
you can also use default dictionaries with int type.
from collections import defaultdict
wordDict = defaultdict(int)
text = 'this is the textfile, and it is used to take words and count'.split(" ")
for word in text:
wordDict[word]+=1
explanation: we initialize a default dictionary whose values are of the type int. This way the default value for any key will be 0 and we don't need to check if a key is present in the dictionary or not. we then split the text with the spaces into a list of words. then we iterate through the list and increment the count of the word's count.
来源:https://stackoverflow.com/questions/21852066/counting-word-frequency-and-making-a-dictionary-from-it