How to handle OUT OF MEMORY error for multiple threads in a Java Web Crawler

血红的双手。 提交于 2019-12-06 09:43:49

The simple answer (see above) is to increase the JVM memory size. This will help, but it is likely that the real problem is that your web crawling algorithm is creating an in-memory data structure that grows in proportion to the number of pages you visit. If that is the case, the solution maybe to move the data in that data structure to disc; e.g. a database.

The most appropriate solution to your problem depends on how your web crawler works, what it is collecting, and how many pages you need to crawl.

My first suggestion is that you increase the heap size for the JVM:

http://www.informix-zone.com/node/46

Regarding the speed of your program:

If your web crawler obeys the robots.txt file on servers, (which it should to avoid being banned by the site admins) then there may be little that can be done.

You should profile your program, but I expect most of the time is your crawler downloading html pages, and site admins will usually not be happy if you download so fast you drain their bandwidth.

In summary, Downloading a whole site without hurting that site will take a while.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!