nltk stemmer: string index out of range

回眸只為那壹抹淺笑 提交于 2019-12-03 01:58:06

This is an NLTK bug specific to NLTK version 3.2.2, for which I am to blame. It was introduced by PR https://github.com/nltk/nltk/pull/1261 which rewrote the Porter stemmer.

I wrote a fix which went out in NLTK 3.2.3. If you're on version 3.2.2 and want the fix, just upgrade - e.g. by running

pip install -U nltk

I debugged nltk.stem.porter module using pdb. After a few iterations, in _apply_rule_list() you get:

>>> rule
(u'at', u'ate', None)
>>> word
u'o'

At this point the _ends_double_consonant() method tries to do word[-1] == word[-2] and it fails.

If I'm not mistaken, in NLTK 3.2 the relative method was the following:

def _doublec(self, word):
    """doublec(word) is TRUE <=> word ends with a double consonant"""
    if len(word) < 2:
        return False
    if (word[-1] != word[-2]):      
        return False        
    return self._cons(word, len(word)-1)

As far as I can see, the len(word) < 2 check is missing in the new version.

Changing _ends_double_consonant() to something like this should work:

def _ends_double_consonant(self, word):
      """Implements condition *d from the paper

      Returns True if word ends with a double consonant
      """
      if len(word) < 2:
          return False
      return (
          word[-1] == word[-2] and
          self._is_consonant(word, len(word)-1)
      )

I just proposed this change in the related NLTK issue.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!