crawler4j | 易学教程

why is crawler4j hanging randomly?

阅读更多关于 why is crawler4j hanging randomly?

问题 I've been using crawler4j for a few months now. I recently started noticing that it hangs on some of the sites to never return. The recommended solution is to set resumable to true. This is not an option for me as I am limited on space. I ran multiple tests and noticed that the hang was very random. It will crawl between 90-140 urls and then stop. I thought maybe it was the site but there is nothing suspicious in the sites robot.txt and all pages respond with 200 OK. I know the crawler hasn't

How to schedule crawler4j crawl control to run periodically?

阅读更多关于 How to schedule crawler4j crawl control to run periodically?

问题 I'm using crawler4j to build a simple web crawler. What I want to do is to invoke the crawl control every 10 minutes. I created a servlet that starts when my Tomcat server starts, and in the servlet I am using ScheduledExecutorService for the scheduling. However, the crawl control only fetches me data ONCE (not every 10 mins like I wanted). Is there a better way to schedule my crawl to execute every 10 mins? Below is my code in the servlet. public class ScheduleControl extends HttpServlet {

Calling Controller.Start in loop in Crawler4j?

阅读更多关于 Calling Controller.Start in loop in Crawler4j?

问题 I asked one question here. But this is kind of other question that sounds similar. Using crawler4j, I want to crawl multiple seed urls with restriction on domain name (that is domain name check in shouldVisit). Here is an example of how to do it. In short, you set list of domain names using customData and then pass it to crawler class (from controller) and in shouldVisit function, we loop through this data (which is a list, see linked url) to see if domain name is there in list, if so return

How to add ( integrate ) crawljax with crawler4j?

阅读更多关于 How to add ( integrate ) crawljax with crawler4j?

问题 I am working on web crawler which fetch data form website using crawler4j and everything goes well but the main problem is with ajax-based events . So, I found crawljax library does this matter but I couldn't where and when to use it . When have I use it ( I mean work sequences )? before fetching page using crawler4j. Or after fetching page using crawler4j. Or have I use url coming using crawler4j and use it to fetch Ajax data (page) using crawljax. 回答1: The library crawljax is basically a

Syntax error, insert “… VariableDeclaratorId” to complete FormalParameterList

阅读更多关于 Syntax error, insert “… VariableDeclaratorId” to complete FormalParameterList

问题 I am facing some issues with this code: import edu.uci.ics.crawler4j.crawler.CrawlConfig; import edu.uci.ics.crawler4j.crawler.CrawlController; import edu.uci.ics.crawler4j.fetcher.PageFetcher; import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig; import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer; public class Controller { String crawlStorageFolder = "/data/crawl/root"; int numberOfCrawlers = 7; CrawlConfig config = new CrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); /

Crawler4j vs. Jsoup for the pages crawling and parsing in Java

阅读更多关于 Crawler4j vs. Jsoup for the pages crawling and parsing in Java

问题 I want to get the content of a page and extract the specific parts of it. As far as I know, there are at least two solutions for such task: Crawler4j and Jsoup. Both of them are capable retrieving the content of a page and extract sub-parts of it. The only thing I'm not sure about, what is the difference between them? There is a similar question, which is marked as answered: Crawler4j is a crawler, Jsoup is a parser. But I just checked, Jsoup is also capable crawling a page in addition to a

Crawler4j with authentication

阅读更多关于 Crawler4j with authentication

问题 I'm trying to execute the crawler4j in a personal redmine for testing purposes. I want to authenticate and crawl several leves of depth in the application. I follow this tutorial from the FAQ of crawler4j. And create the next snippet: import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.parser.HtmlParseData; import edu.uci.ics.crawler4j.url.WebURL; public class CustomWebCrawler extends WebCrawler{ @Override public void visit

Crawler in Groovy (JSoup VS Crawler4j)

阅读更多关于 Crawler in Groovy (JSoup VS Crawler4j)

问题 I wish to develop a web crawler in Groovy(using Grails framework and MongoDB database) that has the ability to crawl a website, creating a list of site URLs and their resource types, their content, the response times and number of redirects involved. I am debating over JSoup vs Crawler4j. I have read about what they basically do but I cannot understand clearly the difference between the two. Can anyone suggest which would be a better one for the above functionality? Or is it totally incorrect

Crawler4j with authentication

阅读更多关于 Crawler4j with authentication

I'm trying to execute the crawler4j in a personal redmine for testing purposes. I want to authenticate and crawl several leves of depth in the application. I follow this tutorial from the FAQ of crawler4j. And create the next snippet: import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.parser.HtmlParseData; import edu.uci.ics.crawler4j.url.WebURL; public class CustomWebCrawler extends WebCrawler{ @Override public void visit(final Page pPage) { if (pPage.getParseData() instanceof HtmlParseData) { System.out.println("URL: " +

Crawler in Groovy (JSoup VS Crawler4j)

阅读更多关于 Crawler in Groovy (JSoup VS Crawler4j)

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I wish to develop a web crawler in Groovy(using Grails framework and MongoDB database) that has the ability to crawl a website, creating a list of site URLs and their resource types, their content, the response times and number of redirects involved. I am debating over JSoup vs Crawler4j. I have read about what they basically do but I cannot understand clearly the difference between the two. Can anyone suggest which would be a better one for the above functionality? Or is it totally incorrect to compare the two? Thanks. 回答1: Crawler4J is a