nutch与起点R3集成之笔记（一）

百度、google帮我们找Internet的信息，但对于一个行业内部网（intranet）来说，百度、google就无法帮忙了。并且，对一个行业来说，更多的信息都是放在行业内部网上，并且采用网页、office文档、图片、视频、音频等格式存放。如何方便，快捷，安全地获取行业内部的信息内容，建立一个行业内部网的搜索引擎就显得尤为重要。

佛山起点软件(http://www.rivues.com)推出了起点R3软件，是一个开箱即用的企业级搜索引擎产品，并且已开源，下载地址http://sourceforge.net/projects/rivues/files/，最新版本是5.3，安装后，试了一下，非常不错，很快可以建一个桌面搜索（对本地文件建一个搜索引擎），但没有看到对网站内容采集界面。

nutch是apache项目的一个开源软件，最新版本是1.3，是一个强大的网页、索引工具，1.3版本好像只对solr建立索引，去掉了lucene索引(bin/nutch index 命令不能使用）。solr也是apache项目的一个开源软件，主要是基于lucene的一个索引工具，但搜索结果返回的是xml、json等格式，需要用户开发html展示模块。

其实，起点R3软件也是基于solr建立的索引，有非常完善展示界面。笔者通过对起点R3源码进行分析后，实现了用nutch来采集网站信息，用起点R3来实现建立索引，并提供用户搜索界面。先将其过程写出来，供大家参考。

一、起点R3软件的安装与配置

1.从http://sourceforge.net/projects/rivues/files/下载起点R3 5.3版本qidian_r3_fulltext_search_5.3_without_jdk.zip，并展开到一个目录，如d:\r3目录。

2.如果没有没有jdk1.6，从http://www.java.net/download/jdk6/6u10/promoted/b32/binaries/jdk-6u10-rc2-bin-b32-windows-i586-p-12_sep_2008.exe下载jdk1.6版本 ,安装后，并在windows的path环境变量中加上jdk的bin路径。

3.打开d:\r3\bin文件夹，点击startup.bat启动r3。在浏览器中键入http://127.0.0.1:880/，若能打开r3搜索页面，说明R3安装与配置成功。

二、在R3中添加nutch对应的索引字段

要把nutch采集的网页信息索引到起点R3的索引文件中，必须知道nutch用index对应的索引字段，并且要在起点R3的索引结构中加入对应的索引字段。

1.从http://www.apache.org/dist//nutch/apache-nutch-1.3-bin.zip下载nutch-1.3，展开该文件后，打开conf目录下的schema.xml文件。其中涉及fields内容如下：

<fields>
        <field name="id" type="string" stored="true" indexed="true"/>

        <!-- core fields -->
        <field name="segment" type="string" stored="true" indexed="false"/>
        <field name="digest" type="string" stored="true" indexed="false"/>
        <field name="boost" type="float" stored="true" indexed="false"/>

        <!-- fields for index-basic plugin -->
        <field name="host" type="url" stored="false" indexed="true"/>
        <field name="site" type="string" stored="false" indexed="true"/>
        <field name="url" type="url" stored="true" indexed="true"
            required="true"/>
        <field name="content" type="text" stored="false" indexed="true"/>
        <field name="title" type="text" stored="true" indexed="true"/>
        <field name="cache" type="string" stored="true" indexed="false"/>
        <field name="tstamp" type="long" stored="true" indexed="false"/>

        <!-- fields for index-anchor plugin -->
        <field name="anchor" type="string" stored="true" indexed="true"
            multiValued="true"/>

        <!-- fields for index-more plugin -->
        <field name="type" type="string" stored="true" indexed="true"
            multiValued="true"/>
        <field name="contentLength" type="long" stored="true"
            indexed="false"/>
        <field name="lastModified" type="long" stored="true"
            indexed="false"/>
        <field name="date" type="string" stored="true" indexed="true"/>

        <!-- fields for languageidentifier plugin -->
        <field name="lang" type="string" stored="true" indexed="true"/>

        <!-- fields for subcollection plugin -->
        <field name="subcollection" type="string" stored="true"
            indexed="true"/>

        <!-- fields for feed plugin -->
        <field name="author" type="string" stored="true" indexed="true"/>
        <field name="tag" type="string" stored="true" indexed="true"/>
        <field name="feed" type="string" stored="true" indexed="true"/>
        <field name="publishedDate" type="string" stored="true"
            indexed="true"/>
        <field name="updatedDate" type="string" stored="true"
            indexed="true"/>
    </fields>
    <uniqueKey>id</uniqueKey>
    <defaultSearchField>content</defaultSearchField>
    <solrQueryParser defaultOperator="OR"/>

其中id、segment、digest、boost为核心字段， host、site、url、content、title、cache、tstamp为index-basic要用到的索引字段，anchor为index-anchor要用到的索引字段， type、contentLength、lastModified、date为index-more要用到的索引字段。所以必须在R3中的建立对应的字段。

分析R3源码发现，R3内置了solr，但对索引字段像solr那样不是存放在conf的schema.xml文件中，估计是为了用户管理索引字段方便，将索引字段定义的结果存放在一个derby的数据库中，derby也是apache项目的开源软件。

进入http://127.0.0.1:880/的界面后，点击login，用admin用户登陆，点击“索引字段”图标，进入索引字段管理界面，在索引字段中加入下表中的 6-19字段。

序号	中文别称	名称	数据类型	存储	索引	多值	统计字段	排序字段	复制到字段
1	标题	title	text	true	true	true	true	true	
2	内容	text	text	true	true	true	false	false	
3	附件数	files	tint	true	true	true	false	true	
4	文档类型	contentType	string	true	true	true	true	false	
5	文件目录	parent	string	false	true	true	true	false	
6	内容	content	text	true	true	true	true	true	text
7	标识	url	text	true	true	true	false	true	parent
8	主机名	host	string	true	true	false	false	true	
9	缓存	cache	string	true	true	true	false	false	
10	邮戳	tstamp	tdate	true	true	false	false	false	
11	锚	anchor	text	true	true	true	false	true	
12	内容长	contentLength	string	true	true	false	false	false	
13	时间	lastModified	tdate	true	true	false	false	true	
14	站点	site	string	true	true	false	false	false	
15	段	segment	string	true	true	false	false	false	
16	digest	digest	string	true	true	false	false	false	
17	boost	boost	float	true	true	false	false	true	
18	类型	type	string	true	true	true	true	false	
19	日期	date	tdate	true	true	false	false	false

其中将content、url分别设置复制为text、parent字段。需要注意的是添加索引字段数据类型、多值属性不能错，如anchor、type的多值属性为true，如果定义错了，在从nutch中索引到R3时会不成功。

（未完待续）

来源：oschina

链接：https://my.oschina.net/u/164278/blog/28549

标签

nutch