nutch

Nutch1.7 配置到 eclipse

谁都会走 提交于 2019-11-28 11:55:33
在nutch项目集成到eclipse中遇到了很多的问题,第一次参照网上资料配置成功后,由于没做笔记。第二天写笔记时,没想到又忘记了。诶,悲催的又花了一天。总觉得这次配环境花的时间真的是太多了。 进入正题: 1、将nutch项目部署到eclipse 首先到nutch官网上找到FAQ链接http://wiki.apache.org/nutch/FAQ,进入链接 点击第二个链接查看。 参照文档进行配置,有卡壳不明白的地方,继续在网上针对具体问题查找资料。集成时可参照http://blog.csdn.net/witsmakemen/article/details/8866235。 运行之前必须有一下先决条件: A、在windows上安装配置Apache Ant,http://ant.apache.org/manual/index.html B、安装Eclipse,这个不用说肯定是必须的。 C、在Linux安装svn。目的:迁出nutch1.7源代码。 D、在Linux下检出nutch1.7代码 [root@nutch-five branch-1.7]# svn co http://svn.apache.org/repos/asf/nutch/branches/branch-1.7/ E、在Linux安装ivy插件。目的,根据ivy配置文件动态下载jar包。 F、编译branch1.7

Apache Nutch: FetcherJob throws NoSuchElementException deep in Gora

一个人想着一个人 提交于 2019-11-28 08:34:26
问题 I'm running Apache Nutch 2.3.1 out of the box, which uses Gora 0.6.1. I've followed the instructions here: http://wiki.apache.org/nutch/RunNutchInEclipse It ran fine with the InjectorJob . Now I'm running the FetcherJob , and Gora uses MemStore as a data store. I have gora.properties containing gora.datastore.default=org.apache.gora.memory.store.MemStore This throws: 2016-10-02 22:55:54,605 ERROR mapreduce.GoraRecordReader (GoraRecordReader.java:nextKeyValue(121)) - Error reading Gora records

Nutch regex-urlfilter syntax

折月煮酒 提交于 2019-11-28 08:30:29
问题 I am running Nutch v. 1.6 and it is crawling specific sites correctly, but I can't seem to get the syntax correct for the file NUTCH_ROOT/conf/regex-urlfilter.txt . The site I want to crawl has a URL similar to this: http://www.example.com/foo.cfm On that page there are numerous links that match the following pattern: http://www.example.com/foo.cfm/Bar_-_Foo/Extra/EX/20817/ID=6976 I want to crawl links that match second example above as well. In my regex-urlfilter.txt I have the following: +

How do I save the origin html file with Apache Nutch

﹥>﹥吖頭↗ 提交于 2019-11-28 00:22:52
I'm new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the binary database files. How do I get the original html files with Nutch? Does Nutch support it? If not, what other tools can I use to achieve my goal.(The tools that support distributed crawling are better.) Well, nutch will write the crawled data in binary form so if if you want that to be saved in html format, you will have to modify the code. (this will be painful if you are new to nutch). If you want quick and easy solution

How to get the html content from nutch

萝らか妹 提交于 2019-11-27 21:39:55
Is there is any way to get the html content of each webpage in nutch while crawling the web page? Yes, you can acutally export the content of the crawled segments. It is not straightforward, but it works well for me. First, create a java project with the following code: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.Text; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; import java.io.File; import java.io

配置Nutch模拟浏览器以绕过反爬虫限制

[亡魂溺海] 提交于 2019-11-27 09:18:55
当我们配置Nutch抓取 http://yangshangchuan.iteye.com 的时候,抓取的所有页面内容均为: 您的访问请求被拒绝 ...... 这是最简单的 反爬虫策略( 该策略简单地读取HTTP请求头User-Agent的值来判断是人(浏览器)还是机器爬虫 ) ,我们只需要简单地配置Nutch来 模拟浏览器(simulate web browser) 就可以绕过这种限制。 在 nutch-default.xml 中有 5 项配置是和 User-Agent 相关的: <property> <name>http.agent.description</name> <value></value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value></value> <description>A URL to advertise in the User-Agent header. This will appear

How to get the html content from nutch

戏子无情 提交于 2019-11-26 23:04:56
问题 Is there is any way to get the html content of each webpage in nutch while crawling the web page? 回答1: Yes, you can acutally export the content of the crawled segments. It is not straightforward, but it works well for me. First, create a java project with the following code: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.Text; import org.apache.nutch

How do I save the origin html file with Apache Nutch

蹲街弑〆低调 提交于 2019-11-26 21:40:27
问题 I'm new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the binary database files. How do I get the original html files with Nutch? Does Nutch support it? If not, what other tools can I use to achieve my goal.(The tools that support distributed crawling are better.) 回答1: Well, nutch will write the crawled data in binary form so if if you want that to be saved in html format, you will