nutch

Nutch not crawling URLs except the one specified in seed.txt

做~自己de王妃 提交于 2019-12-01 13:33:06
I am using Apache Nutch 1.12 and the URLs I am trying to crawl is something like https://www.mywebsite.com/abc-def/ which is the only entry in my seed.txt file. Since I don't want any page to be crawl that doesn't have "abc-def" in the URL so I have put the following line in regex-urlfilter.txt : +^https://www.mywebsite.com/abc-def/(.+)*$ When I try to run the following crawl command : **/bin/crawl -i -D solr.server.url=http://mysolr:3737/solr/coreName $NUTCH_HOME/urls/ $NUTCH_HOME/crawl 3** It crawl and index just one seed.txt url and in 2nd iteration it just say: Generator: starting at 2017

Nutch的Hadoop方式爬取效率优化

本小妞迷上赌 提交于 2019-11-30 21:49:51
下面这些是潜在的影响爬取效率的内容(官方资料翻译): 1)DNS设置 2)你的爬虫数量,太多或太少 3)带宽限制 4)每一主机的线程数 5)要抓取的urls的分配不均匀 6) robots.txt中的高爬取延时(通常和urls的分配不均匀同时出现) 7)有很多比较慢的网页(通常和分配不均匀同时出现) 8)要下载太多的内容(PDF,大的html页面,通常和分配不均匀同时出现) 9)其它 那现在怎样改善它们? 1)在每一个本地的爬虫机器上设置DNS,如果是多个爬取机器和一个单独的DNS中心这种情况,那么它就会像有DOS攻击在DNS服务器上那样,使整个系统变慢。我们经常设置两层,首先命中本地DNS缓存,然后就是大的DNS缓存,就像OpenDNS或Verizon。 2)这将是map任务数乘以fetcher.threads.fetch属性值的数量。所以10个map任务*20个线程=一次200个爬取列表。太多的话会超过你系统的负担,太少的话就会使一些机器闲置。你需要认真考虑在你的环境下如何设置这些属性。 3)带宽限制,用ntop,ganglia和其它监控软件来测定你使用了多少的带宽。计算输入和输出的带宽。可以做一个简单的测试,用抓取网络中一台不用作爬虫的服务器中,如果它与其中一台爬虫机器连接时或当那台机器抓取时从中下载信息时非常慢,这时你就可以加大带宽

Latest compatible versions of Nutch and Solr

你。 提交于 2019-11-30 20:46:36
I see different combinations of Nutch and Solr versions being used by people posting about this subject on the web. Which are the latest stable (non beta) and compatible versions of Nutch and Solr that I can download and setup without building sources and just configuring ? You can use Nutch 2.1 or Nutch 1.6. If you want to use HBase, have to use Nutch 2.x. Because nutch 1.6 not support Hbase. I use nutch 2.1, HBase 0.90.x or 0.94.5 , and Solr 4.3.0. There are major changes between the two Solr versions (Solr 3.x and Solr 4.x). You must choose one of them according to your requirement. Ex:

Nutch API advice

不羁的心 提交于 2019-11-30 20:33:53
I'm working on a project where I need a mature crawler to do some work, and I'm evaluating Nutch for this purpose. My current needs are relatively straightforward: I need a crawler that is able to save the data to disk and I need it to be able to recrawl only the updated resources of a site and skip the parts that are already crawled. Does anyone have any experience working with the Nutch code directly in Java, not via the command line. I would like to start simple: create a crawler (or similar), minimally configure it and start it, nothing fancy. Is there some example for this, or some

Recrawl URL with Nutch just for updated sites

让人想犯罪 __ 提交于 2019-11-30 13:43:25
I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated? İsmet Alkan Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz . You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare

Nutch2.1+Hbase+Solr快速搭建一个爬虫和搜索引擎(快速,基本2小时内搞定)

五迷三道 提交于 2019-11-30 13:02:07
说明:这种方式是为了快速体验或者数据量较小的情况,不适合数据量大的生产环境 环境准备: Centos7 Nutch2.2.1 JAVA1.8 ant1.9.14 hbase0.90.4 (单机版) solr7.7 相关下载地址: 链接: https://pan.baidu.com/s/1Tut2CcKoJ9-G-HBq8zexMQ 提取码: v75v 开始安装 默认安装好的jdk、ant(其实就是解压配置好环境变量不会的可以百度一下) 安装hbase单机版 下载解压 wget http://archive.apache.org/dist/hbase/hbase-0.90.4/hbase-0.90.4.tar.gz tar zxf hbase-0.90.4.tar.gz # 或者直接使用我提供的软件包 配置 <configuration> <property> <name>hbase.rootdir</name> <value>/data/hbase</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/data/zookeeper</value> </property> </configuration> 说明:hbase

安装nutch2+Hbase+Slor4

﹥>﹥吖頭↗ 提交于 2019-11-30 11:09:21
介绍 Nutch 网络爬虫 Hbase 分布式存储系统 Solr 搜索服务器 版本说明 由于Nutch\Hbase\Solr他们都各自做在自己的版本修改,所以配置他们三者有不同的方法.这里也是参考网络文档做了他们最新版本的配置 Nutch版本(2.2.1) apache-nutch-2.2.1-src.tar.gz;由于改版本的Nutch默认的Hbase版本是0.90.4 所以这里就使用的hbase-0.90.4.tar.gz ;同时Nutch默认的Solr版本是3.4.0版本,但由于Solr4与三的版本变化比较大,所以这里我使用的是当前最新的4版本solr-4.4.0.tgz。 apache-nutch-2.2.1-src.tar.gz hbase-0.90.4.tar.gz solr-4.4.0.tgz 下载 Nutch 首页 http://nutch.apache.org/ 下载地址 apache-nutch-2.2.1-src.tar.gz Nutch 首页 http://hbase.apache.org/ 下载地址 hbase-0.90.4.tar.gz Solr 首页 http://lucene.apache.org/solr/ 下载地址 solr-4.4.0.tgz 安装JDK 参考: Linux安装JDK 安装Hbase 单机版的HBase

nutch 部署到eclipse常见错误

好久不见. 提交于 2019-11-30 11:09:07
Nutch 部署到 eclipse 常见错误 Failed to set permissions of path: \tmp\hadoop-hadoop\mapred\staging\hadoop1847455384\.staging to 0700 之前在 eclipse 上部署 hadoop 时好像也遇到过这个问题。但是,现在已经不知道怎么解决了。 所以,笔记很重要!!! 方法 1 : 注释掉 <target name="create-native-configure"> <exec executable="autoreconf" dir="${native.src.dir}" searchpath="yes" failonerror="yes"> <arg value="-if"/> </exec> </target> 去掉 compile-core-native <target name="compile-core-native" depends=" create-native-configure , compile-core-classes" if="compile.native"> 中的 create-native-configure 依赖 3 、修改 hadoop-1.1.2/src/core/org/apache/hadoop/fs/FileUtil.java

Nutch No agents listed in 'http.agent.name'

烈酒焚心 提交于 2019-11-30 05:08:26
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property. at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1068) at org.apache.nutch.crawl.Crawl.run(Crawl.java:135) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect

Latest compatible versions of Nutch and Solr

血红的双手。 提交于 2019-11-30 05:03:13
问题 I see different combinations of Nutch and Solr versions being used by people posting about this subject on the web. Which are the latest stable (non beta) and compatible versions of Nutch and Solr that I can download and setup without building sources and just configuring ? 回答1: You can use Nutch 2.1 or Nutch 1.6. If you want to use HBase, have to use Nutch 2.x. Because nutch 1.6 not support Hbase. I use nutch 2.1, HBase 0.90.x or 0.94.5 , and Solr 4.3.0. There are major changes between the