Apache Nutch 2.1 - How get complete source code

给你一囗甜甜゛ 提交于 2019-12-11 04:23:24

问题


I am trying to write my own Nutch plugin for crawling webpages. The problem is that I need to identify if there is some special tag, e.g. on the webpage. There is some note in official documentation that this is possible using Document.getElementsByTagName("foo") but this is not working for me. Do you have any idea?

My second question is that if I identified tag above, I would like to get some other tags from this webpage where tag was identified... is there any way to store complete source code of the webpage which is crawled at some moment?

Thanks, Jan.


回答1:


If you want to extract content based on an HTML tag, you could look at the xpath-filter plugin: http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ You can write an xpath query and configure it in the plugin to extract the information you need.

Another option is to write a plugin (as you are doing at the moment) and use an HTML/XML parser to get the information out. Here's what I have done when I needed to extract some content out of a specific div:

  @Override
  public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException {

        //LOG.info("filter init: ");
        Metadata metadata = parse.getData().getParseMeta();
        String fullContent = metadata.get("fullcontent");

        Document document = Jsoup.parse(fullContent); 
        Element contentwrapper = document.select("div#content").first();

        //LOG.info("fullcontent");
        //LOG.info(contentwrapper);


        // Add field
        doc.add("contentwrapper", contentwrapper.text());

        return doc;
  }


来源:https://stackoverflow.com/questions/15717239/apache-nutch-2-1-how-get-complete-source-code

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!