Sorting FreqDist in NLTK with get vs get()

被刻印的时光 ゝ 提交于 2020-01-06 02:35:11

问题


I am playing around with NLTK and the module freqDist

import nltk
from nltk.corpus import gutenberg
print(gutenberg.fileids())
from nltk import FreqDist
fd = FreqDist()

for word in gutenberg.words('austen-persuasion.txt'):
    fd[word] += 1

newfd = sorted(fd, key=fd.get, reverse=True)[:10]

So I am playing around with NLTK and have a question regarding the sort portion. When I run the code like this it properly sorts the freqDist object. However when I run it with get() instead of get I encounter the error

Traceback (most recent call last):
  File "C:\Python34\NLP\NLP.py", line 21, in <module>
newfd = sorted(fd, key=fd.get(), reverse=True)[:10]
TypeError: get expected at least 1 arguments, got 0

Why is get right and get() wrong. I was under the impression that get() should be correct, but I guess it is not.


回答1:


Essentially, the FreqDist object in NLTK is a sub-class of the native Python's collections.Counter, so let's see how Counter works:

A Counter is a dictionary which stores the elements in a list as its key and the counts of the elements as the values:

>>> from collections import Counter
>>> Counter(['a','a','b','c','c','c','d'])
Counter({'c': 3, 'a': 2, 'b': 1, 'd': 1})
>>> c = Counter(['a','a','b','c','c','c','d'])

To get a list of elements sorted by their frequency, you can use .most_common() function and it will return a tuple of the element and its count sorted by the counts.

>>> c.most_common()
[('c', 3), ('a', 2), ('b', 1), ('d', 1)]

And in reverse:

>>> list(reversed(c.most_common()))
[('d', 1), ('b', 1), ('a', 2), ('c', 3)]

Like a dictionary you can iterate through a Counter object and it will return the keys:

>>> [key for key in c]
['a', 'c', 'b', 'd']
>>> c.keys()
['a', 'c', 'b', 'd']

You can also use the .items() function to get a tuple of the keys and their values:

>>> c.items()
[('a', 2), ('c', 3), ('b', 1), ('d', 1)]

Alternatively, if you only need the keys sorted by their counts, see Transpose/Unzip Function (inverse of zip)?:

>>> k, v = zip(*c.most_common())
>>> k
('c', 'a', 'b', 'd')

Going back to the question of .get vs .get(), the former is the function itself, while the latter is an instance of the function that requires the key of the dictionary as its parameter:

>>> c = Counter(['a','a','b','c','c','c','d'])
>>> c
Counter({'c': 3, 'a': 2, 'b': 1, 'd': 1})
>>> c.get
<built-in method get of Counter object at 0x7f5f95534868>
>>> c.get()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: get expected at least 1 arguments, got 0
>>> c.get('a')
2

When invoking the sorted(), the key=... parameter inside the sorted function is not the key of the list/dictionary you're sorting but the key that sorted should use for sorting.

So these are the same, but they only return the values of the keys:

>>> [c.get(key) for key in c]
[2, 3, 1, 1]
>>> [c[key] for key in c]
[2, 3, 1, 1]

And when sorting, the values are used as the criteria for sorting, so these achieves the same output:

>>> sorted(c, key=c.get)
['b', 'd', 'a', 'c']
>>> v, k = zip(*sorted((c.get(key), key) for key in c))
>>> list(k)
['b', 'd', 'a', 'c']
>>> sorted(c, key=c.get, reverse=True) # Highest to lowest
['c', 'a', 'b', 'd']
>>> v, k = zip(*reversed(sorted((c.get(key), key) for key in c)))
>>> k
('c', 'a', 'd', 'b')


来源:https://stackoverflow.com/questions/37427673/sorting-freqdist-in-nltk-with-get-vs-get

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!