How to set delimiters for PTB tokenizer?

落爺英雄遲暮 提交于 2019-12-25 07:39:41

问题


I'm using StanfordCore NLP Library for my project.It uses PTB Tokenizer for tokenization.For a statement that goes like this- go to room no. #2145 or

go to room no. *2145

tokenizer is splitting #2145 into two tokens: #,2145. Is there any way possible to set tokenizer so that it does't identify #,* like a delimiter?


回答1:


A quick solution is to use this option:

(command-line) -tokenize.whitespace
(in Java code) props.setProperty("tokenize.whitespace", "true");

This will cause the tokenizer to just tokenize on white space. Do you need it to do anything other than tokenize on white space?



来源:https://stackoverflow.com/questions/32688640/how-to-set-delimiters-for-ptb-tokenizer

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!