Mac自己搭建爬虫搜索引擎(nutch+elasticsearch是失败的尝试,改用scrapy+elasticsearch)

感情迁移 提交于 2020-02-22 15:23:21

1.引言

项目需要做爬虫并能提供个性化信息检索及推送,发现各种爬虫框架。其中比较吸引的是这个:

Nutch+MongoDB+ElasticSearch+Kibana 搭建搜索引擎

E文原文在:http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/

考虑用docker把系统搭建起来测试:

docker来源如下:

https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html

https://store.docker.com/community/images/pure/nutch-mongo

然而,docker下载image时实在是太慢,放弃docker!

 

Mac 设置JAVA_HOME:

vi ~/.bash_profile

export JAVA_HOME=$(/usr/libexec/java_home)
export PATH=$JAVA_HOME/bin:$PATH
export CLASS_PATH=$JAVA_HOME/lib

 

2.安装Mongo

Mac下直接用brew安装,此时最新版本是3.4.7。

安装好后建/data/db目录,mongod启动服务。

测试可用mongo命令连接,输入dbs查看数据库。

brew install mongo
sudo mkdir /data/db
sudo chown <你都用户名>  /data

mongod

3.安装es+kibana

下载es, 最新版是5.5.1. 地址:https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.1.tar.gz

修改配置

$ vim config/elasticsearch.yml
cluster.name: my-application
node.name: "node-1"
node.master: true
node.data: true
path.data: /opt/elasticsearch/data
network.bind_host: 127.0.0.1
network.publish_host: 127.0.0.1
network.host: 127.0.0.1
 
运行命令:bin/elasticsearch
浏览器访问:http://localhost:9200
 

下载kibana, 最新版是5.5.1,地址:Mac

运行命令:bin/kibana

浏览器访问:http://localhost:5601

 

4.安装Apache nutch

下载Apache Nutch 2.3.1 (src.tar.gz): http://nutch.apache.org/downloads.html

配置环境变量:export NUTCH_HOME=$(pwd)

修改配置

$ cat conf/nutch-site.xml
<configuration>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.mongodb.store.MongoStore</value>
    <description>Default class for storing data</description>
  </property>
</configuration>
 
解除注释mongodb相关注释:
$NUTCH_HOME/ivy/ivy.xml:

<dependency org="org.apache.gora" name="gora-mongodb" rev="0.5" conf="*->default" />

$NUTCH_HOME/conf/gora.properties

############################
# MongoDBStore properties #
############################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=nutch
 
重要!需要更新elastic插件!原插件版本1.4.1,现最新是5.5.1.
修改 

cd src/plugin/indexer-elastic/

vi src/plugin/indexer-elastic/ivy.xml 

...

  <dependencies>

    <dependency org="org.elasticsearch" name="elasticsearch"

      rev="5.5.1" conf="*->default" />

  </dependencies>

...

ant -f ./build-ivy.xml 

ls lib 查看版本,更新plugin.xml中版本号。

<library name="HdrHistogram-2.1.9.jar"/>
<library name="elasticsearch-5.5.1.jar"/>
<library name="hppc-0.7.1.jar"/>
<library name="jackson-core-2.8.6.jar"/>
<library name="jackson-dataformat-cbor-2.8.6.jar"/>
<library name="jackson-dataformat-smile-2.8.6.jar"/>
<library name="jackson-dataformat-yaml-2.8.6.jar"/>
<library name="jna-4.4.0.jar"/>
<library name="joda-time-2.9.5.jar"/>
<library name="jopt-simple-5.0.2.jar"/>
<library name="log4j-api-2.8.2.jar"/>
<library name="lucene-analyzers-common-6.6.0.jar"/>
<library name="lucene-backward-codecs-6.6.0.jar"/>
<library name="lucene-core-6.6.0.jar"/>
<library name="lucene-grouping-6.6.0.jar"/>
<library name="lucene-highlighter-6.6.0.jar"/>
<library name="lucene-join-6.6.0.jar"/>
<library name="lucene-memory-6.6.0.jar"/>
<library name="lucene-misc-6.6.0.jar"/>
<library name="lucene-queries-6.6.0.jar"/>
<library name="lucene-queryparser-6.6.0.jar"/>
<library name="lucene-sandbox-6.6.0.jar"/>
<library name="lucene-spatial-6.6.0.jar"/>
<library name="lucene-spatial-extras-6.6.0.jar"/>
<library name="lucene-spatial3d-6.6.0.jar"/>
<library name="lucene-suggest-6.6.0.jar"/>
<library name="securesm-1.1.jar"/>
<library name="snakeyaml-1.15.jar"/>
<library name="t-digest-3.0.jar"/>

然而!更大的坑是这个plugin代码出错了!不折腾了,放弃!

开始编译:ant runtime    (跑了33分钟!)
 

结论

1. nutch 2.x 与 elasticsearch 5.x暂时不能很好兼容,不想折腾,放弃。

2. 下次尝试新的架构:scrapy + scrapy-redis + mongodb + elasticsearch

 
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!