nltk regular expression tokenizer

后端 未结 1 1499
情深已故
情深已故 2020-12-10 07:14

I tried to implement a regular expression tokenizer with nltk in python, but the result is this:

>>> import nltk
>>> text = \'That U.S.A. p         


        
相关标签:
1条回答
  • 2020-12-10 07:30

    You should turn all capturing groups to non-capturing:

    • ([A-Z]\.)+ > (?:[A-Z]\.)+
    • \w+(-\w+)* -> \w+(?:-\w+)*
    • \$?\d+(\.\d+)?%? to \$?\d+(?:\.\d+)?%?

    The issue is that regexp_tokenize seems to be using re.findall that returns capture tuple lists when multiple capture groups are defined in the pattern. See this nltk.tokenize package reference:

    pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)

    Also, I am not sure you wanted to use :-_ that matches a range including all uppercase letters, put the - to the end of the character class.

    Thus, use

    pattern = r'''(?x)          # set flag to allow verbose regexps
            (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
          | \w+(?:-\w+)*        # words with optional internal hyphens
          | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
          | \.\.\.              # ellipsis
          | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
        '''
    
    0 讨论(0)
提交回复
热议问题