NLP: Building (small) corpora, or “Where to get lots of not-too-specialized English-language text files?”

前端 未结 7 870
温柔的废话
温柔的废话 2021-01-13 03:41

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a

7条回答
  •  温柔的废话
    2021-01-13 03:47

    • Use the Wikipedia dumps
      • needs lots of cleanup
    • See if anything in nltk-data helps you
      • the corpora are usually quite small
    • the Wacky people have some free corpora
      • tagged
      • you can spider your own corpus using their toolkit
    • Europarl is free and the basis of pretty much every academic MT system
      • spoken language, translated
    • The Reuters Corpora are free of charge, but only available on CD

    You can always get your own, but be warned: HTML pages often need heavy cleanup, so restrict yourself to RSS feeds.

    If you do this commercially, the LDC might be a viable alternative.

提交回复
热议问题