crawler4j | 易学教程

不会python?那就换一种姿势爬虫！Java爬虫技术总结

阅读更多关于不会python?那就换一种姿势爬虫！Java爬虫技术总结

—本博客为原创内容，转载需注明本人— 前几天有个师妹将要毕业，需要准备毕业论文，但是论文调研需要数据资料，上知网一查，十几万条数据！指导老师让她手动copy收集，十几万的数据手动copy要浪费多少时间啊，然后她就找我帮忙。我想了一下，写个爬虫程序去爬下来或许是个不错的解决方案呢！之前一直听其他人说爬虫最好用python，但是我是一名Java工程师啊！鲁迅曾说过，学python救不了中国人，但是Java可以！好啦，开个玩笑，主要是她急着要，我单独学一门语言去做爬虫，有点不现实，然后我就用了Java，去知乎看一下，发现原来Java也有很多开源的爬虫api嘛，然后就是开始干了，三天时间写好程序，可以爬数据下来，下面分享一下技术总结，感兴趣的朋友可以一起交流一下！在分享技术之前，先简单说一下爬虫的原理吧。网络爬虫听起来很高大上，其实就是原理很简单，说的通俗一点就是，程序向指定连接发出请求，服务器返回完整的html回来，程序拿到这个html之后就进行解析，解析的原理就是定位html元素，然后将你想要的数据拿下来。那再看一下Java开源的爬虫API，挺多的，具体可以点击链接看一下：推荐一些优秀的开源Java爬虫项目因为我不是要在实际的项目中应用，所以我选择非常轻量级易上手的 crawler4j 。感兴趣的可以去github看看它的介绍，我这边简单介绍一下怎么应用

Web Crawling (Ajax/JavaScript enabled pages) using java

阅读更多关于 Web Crawling (Ajax/JavaScript enabled pages) using java

问题 I am very new to this web crawling. I am using crawler4j to crawl the websites. I am collecting the required information by crawling these sites. My problem here is I was unable to crawl the content for the following site. http://www.sciencedirect.com/science/article/pii/S1568494612005741 . I want to crawl the following information from the aforementioned site (Please take a look at the attached screenshot). If you observe the attached screenshot it has three names (Highlighted in red boxes).

Restricting URLs to seed URL domain only crawler4j

阅读更多关于 Restricting URLs to seed URL domain only crawler4j

问题 I want crawler4j to visit pages in such a manner that they belong to domain in seed only. There multiple domains in seed. How can I do it? Suppose I am adding seed URLs: www.google.com www.yahoo.com www.wikipedia.com Now I am starting the crawling but I want my crawler to visit pages (just like shouldVisit() ) only in above three domains. Obviously there external links, but I want my crawler to restrict to these domains only. Sub-domain, sub-folders are okay, but not outside these domains.

Parsing robot.txt using java and identify whether an url is allowed

阅读更多关于 Parsing robot.txt using java and identify whether an url is allowed

问题 I am currently using jsoup in a application to parse and analyses web pages.But I want to make sure that I am adhere to the robot.txt rules and only visit pages which are allowed. I am pretty sure that jsoup is not made for this and it's all about web scraping and parsing. So I planned to have function/module which should read the robot.txt of the domain/site and identify whether the url I am going to visit is allowed or not. I did some research and found the followings.But it I am not sure

NoSuchMethodError in crawler4j CrawelController class

阅读更多关于 NoSuchMethodError in crawler4j CrawelController class

问题 I am using example given here And included necessary files(crawler4j-3.3.zip &crawler4j-3.x-dependencies.zip) from [here] (http://code.google.com/p/crawler4j/downloads/list) in my build path and run path. I am getting this error: Exception in thread "main" java.lang.NoSuchMethodError: com.sleepycat.je.EnvironmentConfig.setAllowCreate(Z)Lcom/sleepycat/je/EnvironmentConfig; at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:90) at edu.uci.ics.crawler4j.examples.basic

Use crawler4j to download js files

阅读更多关于 Use crawler4j to download js files

问题 I'm trying to use crawler4j to download some websites. The only problem I have is that even though I return true for all .js files in the shouldVisit function, they never get downloaded. @Override public boolean shouldVisit(WebURL url) { return true; } @Override public void visit(Page page) { String url = page.getWebURL().getURL(); System.out.println("URL: " + url); } The URL for .js files never gets printed out. 回答1: I noticed that "<script>" tags do not get processed by crawler4j. This was

What sequence of steps does crawler4j follow to fetch data?

阅读更多关于 What sequence of steps does crawler4j follow to fetch data?

问题 I'd like to learn, how crawler4j works? Does it fetch web page then download its content and extract it ? What about .db and .cvs file and its structures? Generally ,What sequences it follows? please, I want a descriptive content Thanks 回答1: General Crawler Process The process for a typical multi-threaded crawler is as follows: We have a queue data structure, which is called frontier . Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for

Improving performance of crawler4j

阅读更多关于 Improving performance of crawler4j

问题 I need to write a webscraper that scrapes around ~1M websites and saves their title, description and keywords into 1 big file (containing the scraped URL and the related words). The URLs should be extracted from a big file. I've ran Crawler4j on the 1M URLs file and started the webcrawler using this: controller.start(MyCrawler.class, 20) . 20 is an arbitrary number. Each crawler passes the resulted words into a blocking queue for a single thread to write these words and URL to the file. I've

webmagic的设计机制及原理-如何开发一个Java爬虫

阅读更多关于 webmagic的设计机制及原理-如何开发一个Java爬虫

【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 此文章是webmagic 0.1.0版的设计手册，后续版本的入门及用户手册请看这里： https://github.com/code4craft/webmagic/blob/master/user-manual.md 之前就有网友在博客里留言，觉得webmagic的实现比较有意思，想要借此研究一下爬虫。最近终于集中精力，花了三天时间，终于写完了这篇文章。之前垂直爬虫写了一年多，webmagic框架写了一个多月，这方面倒是有一些心得，希望对读者有帮助。 webmagic的目标一般来说，一个爬虫包括几个部分：页面下载页面下载是一个爬虫的基础。下载页面之后才能进行其他后续操作。链接提取一般爬虫都会有一些初始的种子URL，但是这些URL对于爬虫是远远不够的。爬虫在爬页面的时候，需要不断发现新的链接。 URL管理最基础的URL管理，就是对已经爬过的URL和没有爬的URL做区分，防止重复爬取。内容分析和持久化一般来说，我们最终需要的都不是原始的HTML页面。我们需要对爬到的页面进行分析，转化成结构化的数据，并存储下来。不同的爬虫，对这几部分的要求是不一样的。  对于通用型的爬虫，例如搜索引擎蜘蛛，需要指对互联网大部分网页无差别进行抓取。这时候难点就在于页面下载和链接管理上-

calling controller(crawler4j-3.5) inside loop

阅读更多关于 calling controller(crawler4j-3.5) inside loop

问题 Hi I am calling controller inside for-loop , because I am having more than 100 url, so I am having all in list and I will iterate and crawl the page, I set that url for setCustomData also, because it should not leave the domain. for (Iterator<String> iterator = ifList.listIterator(); iterator.hasNext();) { String str = iterator.next(); System.out.println("cheking"+str); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.setCustomData(str);