SpaCy model training data: WikiNER

佐手、 提交于 2019-12-06 06:06:13

The data server from Joel (and my) former researcher group seems to be offline: http://downloads.schwa.org/wikiner

I found a mirror of the wp3 files here, which are the ones I'm using in spaCy: https://github.com/dice-group/FOX/tree/master/input/Wikiner

To retrain the spaCy model, you'll need to create a train/dev split (I'll get mine online for direct comparison, but for now...just take a random cut), and name the files with the .iob extension. Then use:

spacy convert -n 10 /path/to/file.iob /output/directory

The -n 10 argument is important for use in spaCy: it concatenates sentences into 'pseudo-paragraphs' of 10 sentences each. This lets the model learn that documents can come with multiple sentences.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!