nutch

Nutch 实战

安稳与你 提交于 2019-12-06 06:43:32
基本信息 Nutch是一个开放源代码(open-source)的Java搜索引擎包,它提供了构建一个搜索引擎所需要的全部工具和功能。使用Nutch不仅可以建立自己内部网的搜索引擎,同时也可以针对整个网络建立搜索引擎。除了基本的功能之外,Nutch也还有不少自己的特色,如Map-Reduce、Hadoop、Plugin等。 回页首 Nutch的总体结构 Nutch从总体上看来,分为三个主要的部分:爬行、索引和搜索,各部分之间的关系如图1所示。Web db是Nutch初始运行的URL集合;Fetcher是用来抓取网页的爬行器,也就是平时常说的Crawler;indexer是用来建立索引的部分,它将会生成的索引文件并存放在系统之中;searcher是查询器,用来完成对某一词条的搜索并返回结果。 图 1. Nutch 总体结构 回页首 Nutch 的运行流程 在了解了 Nutch 的总体结构之后,再详细的看看 Nutch 具体是如何运行的?Nutch 的运行流程如图2所示。 1. 将起始 URL 集合注入到 Nutch 系统之中。 2. 生成片段文件,其中包含了将要抓取的 URL 地址。 3. 根据URL地址在互联网上抓取相应的内容。 4. 解析所抓取到的网页,并分析其中的文本和数据。 5. 根据新抓取的网页中的URL集合来更新起始URL集合,并再次进行抓取。 6. 同时

connection refused error when running Nutch 2

一个人想着一个人 提交于 2019-12-06 01:14:16
问题 I am trying to run Nutch 2 crawler on my system but I get the following error: Exception in thread "main" org.apache.gora.util.GoraException: java.io.IOException: java.sql.SQLTransientConnectionException: java.net.ConnectException: Connection refused at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135) at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:69

How to parse content located in specific HTML tags using nutch plugin?

大憨熊 提交于 2019-12-06 01:11:32
问题 I am using Nutch to crawl websites and I want to parse specific sections of html pages crawled by Nutch. For example, <h><title> title to search </title></h> <div id="abc"> content to search </div> <div class="efg"> other content to search </div> I want to parse div element with id ="abc" and class="efg" and so on. I know that I have to create a plugin for customized parsing as htmlparser plugin provided by Nutch removes all html tags, css and javascript content and leaves only text content.

Find all the web pages in a domain and its subdomains

三世轮回 提交于 2019-12-05 21:37:48
I am looking for a way to find all the web pages and sub domains in a domain. For example, in the uoregon.edu domain, I would like to find all the web pages in this domain and in all the sub domains (e.g., cs.uoregon.edu). I have been looking at nutch, and I think it can do the job. But, it seems that nutch downloads entire web pages and indexes them for later search. But, I want a crawler that only scans a web page for URLs that belong to the same domain. Furthermore, it seems that nutch saves the linkdb in a serialized format. How can I read it? I tried solr, and it can read nutch's

How to Crawl .pdf links using Apache Nutch

柔情痞子 提交于 2019-12-05 13:47:41
I got a website to crawl which includes some links to pdf files. I want nutch to crawl that link and dump them as .pdf files. I am using Apache Nutch1.6 also i am tring this in java as ToolRunner.run(NutchConfiguration.create(), new Crawl(), tokenize(crawlArg)); SegmentReader.main(tokenize(dumpArg)); can some one help me on this nimeshjm If you want Nutch to crawl and index your pdf documents, you have to enable document crawling and the Tika plugin: Document crawling 1.1 Edit regex-urlfilter.txt and remove any occurence of "pdf" # skip image and other suffixes we can't yet parse # for a more

Can't get apache nutch to crawl - permissions and JAVA_HOME suspected

可紊 提交于 2019-12-05 08:11:13
问题 I am trying to run a basic crawl as per the NutchTutorial: bin/nutch crawl urls -dir crawl -depth 3 -topN 5 So I have Nutch all installed and set up with Solr. I set my $JAVA_HOME in my .bashrc to /usr/lib/jvm/java-1.6.0-openjdk-amd64 . I don't see any problems when I run bin/nutch from the nutch home directory, but when I try to run the crawl as above I get the following error: log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /usr/share/nutch/logs/hadoop.log

Apache Nutch to index only part of page content

微笑、不失礼 提交于 2019-12-05 07:18:24
问题 Going to use Apache Nutch v1.3 to extract only some specific content from the webpages. Checked parse-html plugin. Seems it normalizes each html page using tagsoup or nekohtml. This is good. I need to extract only text inside <span class='xxx'> and <span class='yyy'> elemetns on the web-page. Would be great if extracted texts are saved into different fields (e.g. content_xxx , content_yyy ). My question is: should I write my own plugin or this could be done using some standard way? The best

Re-crawling websites fast

北战南征 提交于 2019-12-05 07:04:01
问题 I am developing a system that has to track content of few portals and check changes every night (for example download and index new sites that have been added during the day). Content of this portals will be indexed for searching. The problem is in re-crawling this portals - first crawling of portal takes very long (examples of portals: www.onet.pl, www.bankier.pl, www.gazeta.pl ) and I want to re-crawl it faster (as fast as it is possible) for example by checking date of modification but I

How to increase number of documents fetched by Apache Nutch crawler

巧了我就是萌 提交于 2019-12-05 06:22:04
问题 I am using Apache Nutch 2.3 for crawling. There were about 200 urls in seed at start. Now as the time elasped, number of documents crawler are going to decrease or atmost same as at start. How I can configure Nutch so that my documents crawled should be increased? Is there any parameter that can be used to control number of documents? Second, how I can count number of documents crawled per day by nutch? 回答1: One crawl cycle consists of four steps: Generate, Fetch, Parse and Update DB. for

Nutch+solr+mmseg4j集成

◇◆丶佛笑我妖孽 提交于 2019-12-05 04:40:36
第一章 安装配置 solr4.2 # 下载 solr4.2.0 版本 [root@nutch nutch2]# wget http://archive.apache.org/dist/lucene/solr/4.2.0/solr-4.2.0.tgz # 解压 solr4.2.0 文件 [root@nutch nutch2]# tar -xzvf solr-4.2.0.tgz # 把 nutch/conf/schema.xml 复制到 solr/collection1/conf 在 solr4.2.0 版本中,我们需要把 nutch 的 schema-solr4.xml 文件复制到 collection1 下的 conf 目录内,指定为 schema.xml [root@nutch nutch2]# cp /home/nutch2/release-1.6/runtime/local/conf/schema-solr4.xml /home/nutch2/solr-4.2.0/example/solr/collection1/conf/schema.xml # 启动 solr 服务器 [root@nutch example]# java -jar start.jar & 启动之后报错: _version_ does not exist Unable to use updateLog: