crawler4j

calling controller(crawler4j-3.5) inside loop

阅读更多关于 calling controller(crawler4j-3.5) inside loop

Hi I am calling controller inside for-loop , because I am having more than 100 url, so I am having all in list and I will iterate and crawl the page, I set that url for setCustomData also, because it should not leave the domain. for (Iterator<String> iterator = ifList.listIterator(); iterator.hasNext();) { String str = iterator.next(); System.out.println("cheking"+str); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.setCustomData(str); controller.addSeed(str); controller.startNonBlocking(BasicCrawler.class, numberOfCrawlers); controller

Parsing robot.txt using java and identify whether an url is allowed

阅读更多关于 Parsing robot.txt using java and identify whether an url is allowed

I am currently using jsoup in a application to parse and analyses web pages.But I want to make sure that I am adhere to the robot.txt rules and only visit pages which are allowed. I am pretty sure that jsoup is not made for this and it's all about web scraping and parsing. So I planned to have function/module which should read the robot.txt of the domain/site and identify whether the url I am going to visit is allowed or not. I did some research and found the followings.But it I am not sure about these so it would be great if one did same kind of project where robot.txt parsing involved please

订阅 crawler4j