I tried to implement a regular expression tokenizer with nltk in python, but the result is this:
>>> import nltk
>>> text = \'That U.S.A. p
You should turn all capturing groups to non-capturing:
([A-Z]\.)+
> (?:[A-Z]\.)+
\w+(-\w+)*
-> \w+(?:-\w+)*
\$?\d+(\.\d+)?%?
to \$?\d+(?:\.\d+)?%?
The issue is that regexp_tokenize
seems to be using re.findall
that returns capture tuple lists when multiple capture groups are defined in the pattern. See this nltk.tokenize package reference:
pattern (str)
– The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)
Also, I am not sure you wanted to use :-_
that matches a range including all uppercase letters, put the -
to the end of the character class.
Thus, use
pattern = r'''(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(?:-\w+)* # words with optional internal hyphens
| \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():_`-] # these are separate tokens; includes ], [
'''