Facing issue in elasticsearch mapping of nutch crawled document

╄→гoц情女王★ 提交于 2019-12-08 02:47:48

问题


Facing some serious issues while using nutch and elasticsearch for crawling purpose.

We have two data storage engines in our App.

  1. MySql

  2. Elasticsearch

Lets say I have 10 urls stored in urls table of mysql db. Now I want to fetch these urls from table in run time and write these into seed,txt for crawling. I have written all these urls into need,txt at one go. Now my crawling starts and then I index these docs inside elasticsearch in an index(lets say url index). But I want to maintain a reference inside elasticsearch index so that I can fetch a particular url's crawled details for analytics purpose as elasticsearch index only contains crawled data. For ex.

My table structure in mysql is :

Table Url:

id url


1 www.google.com

Elasticsearch index mapping I want is :

Index url:

{ _id: "www.google.com", type: "doc", content : "Hello world" url_id : 1 , . . . }

Here url_id is the field value of id column of the crawled url inside urls table.

I can create separate index for each url but that solution is not ideal because at the end of day I will be having multiple indices. So how to achieve this after crawling. Do I have to modify the elastic search indexer. I am using nutch 1.12 and elastichsearch 1.7.1 .Any help would be greatly appreciated.


回答1:


You should pass the url_id as an additional metadata in your seed list and use the urlmeta and index-metadata plugins so that the Key/Value gets passed to the outlinks (if necessary) or at least be available for the indexing.

See Nutch WIKI for an explanation of how to index metatags.



来源:https://stackoverflow.com/questions/39697398/facing-issue-in-elasticsearch-mapping-of-nutch-crawled-document

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!