Getting OOM while using GATE on large data set

霸气de小男生 提交于 2019-12-02 12:34:52

Processing large (or many) documents in GATE can require lots of memory, GATE needs lots of space to store annotations. On the other hand various processing resources require lots of memory as well: gazetteers, statistical model-based taggers, etc.

A trick in Gate developer GUI is to store the corpus of documents in a data store, then load only the corpus and run the pipeline. GATE is smart enough to load one document at a time, process it, then save & close it before opening the next one. (You can first store an empty corpus in a data store and then "populate" it from a folder, this will again load documents one by one without wasting memory.)

This is exactly what you should do in your code, open document, process, save and close before opening the next one. If you have a single large document you should split it (in a way that doesn't break your annotation performance).

Here is a code example from the "Advanced GATE Embedded" module:

// for each piece of text:

Document doc = (Document)Factory.createResource("gate.corpora.DocumentImpl",
              Utils.featureMap("stringContent", text, "mimeType", mime));
Corpus corpus = Factory.newCorpus("webapp corpus");
try {
  corpus.add(doc);
  application.execute();
  ...
finally {
  corpus.clear();
  Factory.deleteResource(doc);
}
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!