crawler4j

不会python?那就换一种姿势爬虫!Java爬虫技术总结

泪湿孤枕 提交于 2020-04-27 18:17:02
—本博客为原创内容,转载需注明本人— 前几天有个师妹将要毕业,需要准备毕业论文,但是论文调研需要数据资料,上知网一查,十几万条数据!指导老师让她手动copy收集,十几万的数据手动copy要浪费多少时间啊,然后她就找我帮忙。我想了一下,写个爬虫程序去爬下来或许是个不错的解决方案呢!之前一直听其他人说爬虫最好用python,但是我是一名Java工程师啊!鲁迅曾说过,学python救不了中国人,但是Java可以! ​ 好啦,开个玩笑,主要是她急着要,我单独学一门语言去做爬虫,有点不现实,然后我就用了Java,去知乎看一下,发现原来Java也有很多开源的爬虫api嘛,然后就是开始干了,三天时间写好程序,可以爬数据下来,下面分享一下技术总结,感兴趣的朋友可以一起交流一下! ​ 在分享技术之前,先简单说一下爬虫的原理吧。网络爬虫听起来很高大上,其实就是原理很简单,说的通俗一点就是,程序向指定连接发出请求,服务器返回完整的html回来,程序拿到这个html之后就进行解析,解析的原理就是定位html元素,然后将你想要的数据拿下来。 那再看一下Java开源的爬虫API,挺多的,具体可以点击链接看一下: 推荐一些优秀的开源Java爬虫项目 因为我不是要在实际的项目中应用,所以我选择非常轻量级易上手的 crawler4j 。感兴趣的可以去github看看它的介绍,我这边简单介绍一下怎么应用

Web Crawling (Ajax/JavaScript enabled pages) using java

纵然是瞬间 提交于 2020-01-09 09:37:08
问题 I am very new to this web crawling. I am using crawler4j to crawl the websites. I am collecting the required information by crawling these sites. My problem here is I was unable to crawl the content for the following site. http://www.sciencedirect.com/science/article/pii/S1568494612005741 . I want to crawl the following information from the aforementioned site (Please take a look at the attached screenshot). If you observe the attached screenshot it has three names (Highlighted in red boxes).

Restricting URLs to seed URL domain only crawler4j

倖福魔咒の 提交于 2020-01-03 02:55:52
问题 I want crawler4j to visit pages in such a manner that they belong to domain in seed only. There multiple domains in seed. How can I do it? Suppose I am adding seed URLs: www.google.com www.yahoo.com www.wikipedia.com Now I am starting the crawling but I want my crawler to visit pages (just like shouldVisit() ) only in above three domains. Obviously there external links, but I want my crawler to restrict to these domains only. Sub-domain, sub-folders are okay, but not outside these domains.

Parsing robot.txt using java and identify whether an url is allowed

佐手、 提交于 2019-12-30 07:05:27
问题 I am currently using jsoup in a application to parse and analyses web pages.But I want to make sure that I am adhere to the robot.txt rules and only visit pages which are allowed. I am pretty sure that jsoup is not made for this and it's all about web scraping and parsing. So I planned to have function/module which should read the robot.txt of the domain/site and identify whether the url I am going to visit is allowed or not. I did some research and found the followings.But it I am not sure

NoSuchMethodError in crawler4j CrawelController class

浪子不回头ぞ 提交于 2019-12-24 10:11:14
问题 I am using example given here And included necessary files(crawler4j-3.3.zip &crawler4j-3.x-dependencies.zip) from [here] (http://code.google.com/p/crawler4j/downloads/list) in my build path and run path. I am getting this error: Exception in thread "main" java.lang.NoSuchMethodError: com.sleepycat.je.EnvironmentConfig.setAllowCreate(Z)Lcom/sleepycat/je/EnvironmentConfig; at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:90) at edu.uci.ics.crawler4j.examples.basic

Use crawler4j to download js files

半世苍凉 提交于 2019-12-24 08:58:40
问题 I'm trying to use crawler4j to download some websites. The only problem I have is that even though I return true for all .js files in the shouldVisit function, they never get downloaded. @Override public boolean shouldVisit(WebURL url) { return true; } @Override public void visit(Page page) { String url = page.getWebURL().getURL(); System.out.println("URL: " + url); } The URL for .js files never gets printed out. 回答1: I noticed that "<script>" tags do not get processed by crawler4j. This was

What sequence of steps does crawler4j follow to fetch data?

荒凉一梦 提交于 2019-12-24 08:39:05
问题 I'd like to learn, how crawler4j works? Does it fetch web page then download its content and extract it ? What about .db and .cvs file and its structures? Generally ,What sequences it follows? please, I want a descriptive content Thanks 回答1: General Crawler Process The process for a typical multi-threaded crawler is as follows: We have a queue data structure, which is called frontier . Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for

Improving performance of crawler4j

▼魔方 西西 提交于 2019-12-21 05:43:10
问题 I need to write a webscraper that scrapes around ~1M websites and saves their title, description and keywords into 1 big file (containing the scraped URL and the related words). The URLs should be extracted from a big file. I've ran Crawler4j on the 1M URLs file and started the webcrawler using this: controller.start(MyCrawler.class, 20) . 20 is an arbitrary number. Each crawler passes the resulted words into a blocking queue for a single thread to write these words and URL to the file. I've

webmagic的设计机制及原理-如何开发一个Java爬虫

匆匆过客 提交于 2019-12-19 17:53:44
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 此文章是webmagic 0.1.0版的设计手册,后续版本的入门及用户手册请看这里: https://github.com/code4craft/webmagic/blob/master/user-manual.md 之前就有网友在博客里留言,觉得webmagic的实现比较有意思,想要借此研究一下爬虫。最近终于集中精力,花了三天时间,终于写完了这篇文章。之前垂直爬虫写了一年多,webmagic框架写了一个多月,这方面倒是有一些心得,希望对读者有帮助。 webmagic的目标 一般来说,一个爬虫包括几个部分: 页面下载 页面下载是一个爬虫的基础。下载页面之后才能进行其他后续操作。 链接提取 一般爬虫都会有一些初始的种子URL,但是这些URL对于爬虫是远远不够的。爬虫在爬页面的时候,需要不断发现新的链接。 URL管理 最基础的URL管理,就是对已经爬过的URL和没有爬的URL做区分,防止重复爬取。 内容分析和持久化 一般来说,我们最终需要的都不是原始的HTML页面。我们需要对爬到的页面进行分析,转化成结构化的数据,并存储下来。 不同的爬虫,对这几部分的要求是不一样的。 <!--more--> 对于通用型的爬虫,例如搜索引擎蜘蛛,需要指对互联网大部分网页无差别进行抓取。这时候难点就在于页面下载和链接管理上-

calling controller(crawler4j-3.5) inside loop

会有一股神秘感。 提交于 2019-12-19 10:44:12
问题 Hi I am calling controller inside for-loop , because I am having more than 100 url, so I am having all in list and I will iterate and crawl the page, I set that url for setCustomData also, because it should not leave the domain. for (Iterator<String> iterator = ifList.listIterator(); iterator.hasNext();) { String str = iterator.next(); System.out.println("cheking"+str); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.setCustomData(str);