Web scraping, screen scraping, data mining tips? [closed]

会有一股神秘感。 提交于 2019-12-02 21:22:00

I've found JSoup really good for HTML parsing.

For more pointers check this article out: How to write a multi-threaded webcrawler

I used Bixo for extracting the hyperlinks and images doing depth search,. It built over hadoop and cascading so there is a learning curve but the example provided is good enough to config the changes ...

Try using Web-Harvest project.

Checkout JSR-237 for Work Management, which is a cool idea when going multithreaded.

As for scraping, there are several alternatives. If ease of use is most important, I'd advise you to HTMLUnit. Beyond that, you must roll your own

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!