The similar method from the nltk module produces different results on different machines. Why?

后端未结

关注

 2  1026

执念已碎 2021-01-07 23:49

I have taught a few introductory classes to text mining with Python, and the class tried the similar method with the provided practice texts. Some students got different re

2条回答

盖世英雄少女心 (楼主)

2021-01-08 00:15
In short:

It has something to do with how python3 hashes keys when the similar() function uses the Counter dictionary. See http://pastebin.com/ysAF6p6h

See How and why is the dictionary hashes different in python2 and python3?

In long:

Let's start with:
```
from nltk.book import *
```
The import here comes from https://github.com/nltk/nltk/blob/develop/nltk/book.py which import the nltk.text.Text object and read several corpora into the Text object.

E.g. This is how the text1 variable was read from nltk.book:
```
>>> import nltk.corpus
>>> from nltk.text import Text
>>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
```
Now, if we go down to the code for the similar() function at https://github.com/nltk/nltk/blob/develop/nltk/text.py#L377, we see this initialization if it is the first instance of accessing self._word_context_index:
```
def similar(self, word, num=20):
    """
    Distributional similarity: find other words which appear in the
    same contexts as the specified word; list most similar words first.
    :param word: The word used to seed the similarity search
    :type word: str
    :param num: The number of words to generate (default=20)
    :type num: int
    :seealso: ContextIndex.similar_words()
    """
    if '_word_context_index' not in self.__dict__:
        #print('Building word-context index...')
        self._word_context_index = ContextIndex(self.tokens, 
                                                filter=lambda x:x.isalpha(), 
                                                key=lambda s:s.lower())


    word = word.lower()
    wci = self._word_context_index._word_to_contexts
    if word in wci.conditions():
        contexts = set(wci[word])
        fd = Counter(w for w in wci.conditions() for c in wci[w]
                      if c in contexts and not w == word)
        words = [w for w, _ in fd.most_common(num)]
        print(tokenwrap(words))
    else:
        print("No matches")
```
So that points us to the nltk.text.ContextIndex object, that is suppose to collect all the words with the similar context window and store them. The docstring says:

A bidirectional index between words and their 'contexts' in a text. The context of a word is usually defined to be the words that occur in a fixed window around the word; but other definitions may also be used by providing a custom context function.

By default if you're calling the similar() function, it will initialize the _word_context_index with the default context settings i.e. the left and right token window, see https://github.com/nltk/nltk/blob/develop/nltk/text.py#L40
```
@staticmethod
def _default_context(tokens, i):
    """One left token and one right token, normalized to lowercase"""
    left = (tokens[i-1].lower() if i != 0 else '*START*')
    right = (tokens[i+1].lower() if i != len(tokens) - 1 else '*END*')
    return (left, right)
```
From the similar() function, we see that it iterates through the word in context stored in the word_context_index, i.e. wci = self._word_context_index._word_to_contexts.

Essentially, _word_to_contexts is a dictionary where the keys are the words in the corpus and the values are the left and right words from https://github.com/nltk/nltk/blob/develop/nltk/text.py#L55:
```
    self._word_to_contexts = CFD((self._key(w), self._context_func(tokens, i))
                                 for i, w in enumerate(tokens))
```
And here we see that it's a CFD, which is a nltk.probability.ConditionalFreqDist object, which does not include smoothing of token probability, see full code at https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L1646.

The only possibly of getting the different result is when the similar() function loops through the most_common words at https://github.com/nltk/nltk/blob/develop/nltk/text.py#L402

Given that two keys in a Counter object have the same counts, the word with a lower sorted hash will print out first and the hash of the key is dependent on the the CPU's bit-size, see http://www.laurentluce.com/posts/python-dictionary-implementation/

The whole process of finding the similar words itself is deterministic, since:
- the corpus/input is fixed Text(gutenberg.words('melville-moby_dick.txt'))
- the default context for every word is also fixed, i.e. self._word_context_index
- the computation of the conditional frequency distribution for _word_context_index._word_to_contexts is discrete
Except when the function outputs the most_common list, which when there's a tie in the Counter values, it will output the list of keys given their hashes.

In python2, there's no reason to get a different output from different instances of the same machine with the following code:
```
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
```
But in Python3, it gives a different output every time you run text1.similar('monstrous'), see http://pastebin.com/ysAF6p6h

Here's a simple experiment to prove that quirky hashing differences between python2 and python3:
```
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]


alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('barfoo', 1), ('foobar', 1), ('bar', 1), ('foo', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foo', 1), ('barfoo', 1), ('bar', 1), ('foobar', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('bar', 1), ('barfoo', 1), ('foobar', 1), ('foo', 1)]
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...