i am trying to make an inversed document index, therefore i need to know from all unique words in a collection in which doc they occur and how often.
i have used this an
I agree you should avoid the extra classes, and especially __getitem__
. (Small conceptual errors can make __getitem__
or __getattr__
quite painful to debug.)
Python dict
seems quite strong enough for what you are doing.
What about straightforward dict.setdefault
for keyword in uniques: #For every unique word do
for word in text: #for every word in doc:
if (word == keyword):
dictionary.setdefault(keyword, {})
dictionary[keyword].setdefault(filename, 0)
dictionary[keyword][filename] += 1
Of course this would be where dictionary
is just a dict
, and not something from collections
or a custom class of your own.
Then again, isn't this just:
for word in text: #for every word in doc:
dictionary.setdefault(word, {})
dictionary[word].setdefault(filename, 0)
dictionary[word][filename] += 1
No reason to isolate unique instances, since the dict forces unique keys anyway.