The similar method from the nltk module produces different results on different machines. Why?

后端 未结 2 1026
执念已碎
执念已碎 2021-01-07 23:49

I have taught a few introductory classes to text mining with Python, and the class tried the similar method with the provided practice texts. Some students got different re

2条回答
  •  盖世英雄少女心
    2021-01-08 00:15

    In short:

    It has something to do with how python3 hashes keys when the similar() function uses the Counter dictionary. See http://pastebin.com/ysAF6p6h

    See How and why is the dictionary hashes different in python2 and python3?


    In long:

    Let's start with:

    from nltk.book import *
    

    The import here comes from https://github.com/nltk/nltk/blob/develop/nltk/book.py which import the nltk.text.Text object and read several corpora into the Text object.

    E.g. This is how the text1 variable was read from nltk.book:

    >>> import nltk.corpus
    >>> from nltk.text import Text
    >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
    

    Now, if we go down to the code for the similar() function at https://github.com/nltk/nltk/blob/develop/nltk/text.py#L377, we see this initialization if it is the first instance of accessing self._word_context_index:

    def similar(self, word, num=20):
        """
        Distributional similarity: find other words which appear in the
        same contexts as the specified word; list most similar words first.
        :param word: The word used to seed the similarity search
        :type word: str
        :param num: The number of words to generate (default=20)
        :type num: int
        :seealso: ContextIndex.similar_words()
        """
        if '_word_context_index' not in self.__dict__:
            #print('Building word-context index...')
            self._word_context_index = ContextIndex(self.tokens, 
                                                    filter=lambda x:x.isalpha(), 
                                                    key=lambda s:s.lower())
    
    
        word = word.lower()
        wci = self._word_context_index._word_to_contexts
        if word in wci.conditions():
            contexts = set(wci[word])
            fd = Counter(w for w in wci.conditions() for c in wci[w]
                          if c in contexts and not w == word)
            words = [w for w, _ in fd.most_common(num)]
            print(tokenwrap(words))
        else:
            print("No matches")
    

    So that points us to the nltk.text.ContextIndex object, that is suppose to collect all the words with the similar context window and store them. The docstring says:

    A bidirectional index between words and their 'contexts' in a text. The context of a word is usually defined to be the words that occur in a fixed window around the word; but other definitions may also be used by providing a custom context function.

    By default if you're calling the similar() function, it will initialize the _word_context_index with the default context settings i.e. the left and right token window, see https://github.com/nltk/nltk/blob/develop/nltk/text.py#L40

    @staticmethod
    def _default_context(tokens, i):
        """One left token and one right token, normalized to lowercase"""
        left = (tokens[i-1].lower() if i != 0 else '*START*')
        right = (tokens[i+1].lower() if i != len(tokens) - 1 else '*END*')
        return (left, right)
    

    From the similar() function, we see that it iterates through the word in context stored in the word_context_index, i.e. wci = self._word_context_index._word_to_contexts.

    Essentially, _word_to_contexts is a dictionary where the keys are the words in the corpus and the values are the left and right words from https://github.com/nltk/nltk/blob/develop/nltk/text.py#L55:

        self._word_to_contexts = CFD((self._key(w), self._context_func(tokens, i))
                                     for i, w in enumerate(tokens))
    

    And here we see that it's a CFD, which is a nltk.probability.ConditionalFreqDist object, which does not include smoothing of token probability, see full code at https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L1646.


    The only possibly of getting the different result is when the similar() function loops through the most_common words at https://github.com/nltk/nltk/blob/develop/nltk/text.py#L402

    Given that two keys in a Counter object have the same counts, the word with a lower sorted hash will print out first and the hash of the key is dependent on the the CPU's bit-size, see http://www.laurentluce.com/posts/python-dictionary-implementation/


    The whole process of finding the similar words itself is deterministic, since:

    • the corpus/input is fixed Text(gutenberg.words('melville-moby_dick.txt'))
    • the default context for every word is also fixed, i.e. self._word_context_index
    • the computation of the conditional frequency distribution for _word_context_index._word_to_contexts is discrete

    Except when the function outputs the most_common list, which when there's a tie in the Counter values, it will output the list of keys given their hashes.

    In python2, there's no reason to get a different output from different instances of the same machine with the following code:

    $ python
    >>> from nltk.book import *
    >>> text1.similar('monstrous')
    >>> exit()
    $ python
    >>> from nltk.book import *
    >>> text1.similar('monstrous')
    >>> exit()
    $ python
    >>> from nltk.book import *
    >>> text1.similar('monstrous')
    >>> exit()
    

    But in Python3, it gives a different output every time you run text1.similar('monstrous'), see http://pastebin.com/ysAF6p6h


    Here's a simple experiment to prove that quirky hashing differences between python2 and python3:

    alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
    [('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
    alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
    [('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
    alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
    [('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
    
    
    alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
    [('barfoo', 1), ('foobar', 1), ('bar', 1), ('foo', 1)]
    alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
    [('foo', 1), ('barfoo', 1), ('bar', 1), ('foobar', 1)]
    alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
    [('bar', 1), ('barfoo', 1), ('foobar', 1), ('foo', 1)]
    

提交回复
热议问题