crawler4j

calling controller(crawler4j-3.5) inside loop

泄露秘密 提交于 2019-12-01 12:11:10
Hi I am calling controller inside for-loop , because I am having more than 100 url, so I am having all in list and I will iterate and crawl the page, I set that url for setCustomData also, because it should not leave the domain. for (Iterator<String> iterator = ifList.listIterator(); iterator.hasNext();) { String str = iterator.next(); System.out.println("cheking"+str); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.setCustomData(str); controller.addSeed(str); controller.startNonBlocking(BasicCrawler.class, numberOfCrawlers); controller

Parsing robot.txt using java and identify whether an url is allowed

不打扰是莪最后的温柔 提交于 2019-11-30 23:34:59
I am currently using jsoup in a application to parse and analyses web pages.But I want to make sure that I am adhere to the robot.txt rules and only visit pages which are allowed. I am pretty sure that jsoup is not made for this and it's all about web scraping and parsing. So I planned to have function/module which should read the robot.txt of the domain/site and identify whether the url I am going to visit is allowed or not. I did some research and found the followings.But it I am not sure about these so it would be great if one did same kind of project where robot.txt parsing involved please