Facing issue in elasticsearch mapping of nutch crawled document

问题

Facing some serious issues while using nutch and elasticsearch for crawling purpose.

We have two data storage engines in our App.

MySql
Elasticsearch

Lets say I have 10 urls stored in urls table of mysql db. Now I want to fetch these urls from table in run time and write these into seed,txt for crawling. I have written all these urls into need,txt at one go. Now my crawling starts and then I index these docs inside elasticsearch in an index(lets say url index). But I want to maintain a reference inside elasticsearch index so that I can fetch a particular url's crawled details for analytics purpose as elasticsearch index only contains crawled data. For ex.

My table structure in mysql is :

Table Url:

id url

1 www.google.com

Elasticsearch index mapping I want is :

Index url:

{ _id: "www.google.com", type: "doc", content : "Hello world" url_id : 1 , . . . }

Here url_id is the field value of id column of the crawled url inside urls table.

I can create separate index for each url but that solution is not ideal because at the end of day I will be having multiple indices. So how to achieve this after crawling. Do I have to modify the elastic search indexer. I am using nutch 1.12 and elastichsearch 1.7.1 .Any help would be greatly appreciated.

回答1:

You should pass the url_id as an additional metadata in your seed list and use the urlmeta and index-metadata plugins so that the Key/Value gets passed to the outlinks (if necessary) or at least be available for the indexing.

See Nutch WIKI for an explanation of how to index metatags.

来源：https://stackoverflow.com/questions/39697398/facing-issue-in-elasticsearch-mapping-of-nutch-crawled-document

标签

mysql

ElasticSearch

web-crawler

nutch