Removing menu's from html during crawl or indexing with nutch and solr

别等时光非礼了梦想. 提交于 2019-12-03 20:06:30

Here is a patch for SOLR that you can place in your indexing config to ignore the contents of tags you configure. It will only work with XML, though, so if you can tidy your HTML or you know that it is XHTML, then this would work, but it won't work with just any random HTML.

I think you have a few choices:

  1. extend the Nutch HTML parser, and add logic to strip the header out. (There might be better places to do this, like when you have the raw data but before the DOM is parsed)
  2. make your site smart enough to not draw the header when nutch is crawling. This is pretty easy to do by just checking the User-Agent value in the request header. You might need to do a better job of seeding your crawl since the links in the header won't be there to help nutch find the other pages
  3. Somehow get Solr to remove the header for the nutch data. I'm not sure how you'd do this, and I think this means you lose some of the Nutch/Solr synergies.
  4. Somehow edit the Nutch index (just a lucene index). In theory, you could just walk through all documents in the index and do a trimming on the correct property of each Document.

I would think the easiest way to do this, is to do #2 if you have a consistent way of drawing the header (ie a skin or a common include). Then perhaps #1 and #4. I think #3 would be the hardest, but I might be wrong.

A new feature has been introduced in Nutch 1.12 using apache tika parser which works on boilerpipe algorithm to strip off the header and footer content from html pages in parsing stage itself.

We can use following properties in nutch-site.xml to have this implemented :

<!-- parse-tika plugin properties -->
<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  </description>
</property>
<property>
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>DefaultExtractor</value>
  <description>
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
  or CanolaExtractor.
  </description>
</property>

Its working for me. Hope it will work for others as well...:)

For detailed overview, you can refer to this ticket : https://issues.apache.org/jira/browse/NUTCH-961

If you want to do that I believe you should write a customized parser in nutch, such that the data to index does not contain the data. Basically after parsing the text data is raw text without any structure.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!