NLP: Building (small) corpora, or “Where to get lots of not-too-specialized English-language text files?”

前端未结

关注

 7  895

温柔的废话 2021-01-13 03:41

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a

7条回答

温柔的废话 (楼主)

2021-01-13 03:47
- Use the Wikipedia dumps
  - needs lots of cleanup
- See if anything in nltk-data helps you
  - the corpora are usually quite small
- the Wacky people have some free corpora
  - tagged
  - you can spider your own corpus using their toolkit
- Europarl is free and the basis of pretty much every academic MT system
  - spoken language, translated
- The Reuters Corpora are free of charge, but only available on CD
You can always get your own, but be warned: HTML pages often need heavy cleanup, so restrict yourself to RSS feeds.

If you do this commercially, the LDC might be a viable alternative.
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...