robots.txt parser java

筅森魡賤 提交于 2019-12-04 06:19:19

Heritrix is an open-source web crawler written in Java. Looking through their javadoc, I see that they have a utility class Robotstxt for parsing the robots.txt file.

There's also jrobotx library hosted at SourceForge.

(Full disclosure: I spun off the code that forms that library.)

anastluc

There is also a new release of crawler-commons:

https://github.com/crawler-commons/crawler-commons

The library aims to implement functionality common to any web crawler and this includes a very handy robots.txt parser

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!