How to Crawl .pdf links using Apache Nutch

柔情痞子 提交于 2019-12-05 13:47:41
nimeshjm

If you want Nutch to crawl and index your pdf documents, you have to enable document crawling and the Tika plugin:

  1. Document crawling

    1.1 Edit regex-urlfilter.txt and remove any occurence of "pdf"

    # skip image and other suffixes we can't yet parse
    # for a more extensive coverage use the urlfilter-suffix plugin
    -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
    

    1.2 Edit suffix-urlfilter.txt and remove any occurence of "pdf"

    1.3 Edit nutch-site.xml, add "parse-tika" and "parse-html" in the plugin.includes section

    <property>
      <name>plugin.includes</name>
      <value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
      <description>Regular expression naming plugin directory names to
      include.  Any plugin not matching this expression is excluded.
      In any case you need at least include the nutch-extensionpoints plugin. By
      default Nutch includes crawling just HTML and plain text via HTTP,
      and basic indexing and search plugins. In order to use HTTPS please enable 
      protocol-httpclient, but be aware of possible intermittent problems with the 
      underlying commons-httpclient library.
      </description>
    </property>
    
  2. If what you really want is to download all pdf files from a page, you can use something like Teleport in Windows or Wget in *nix.

you can either write your own own plugin, for pdf mimetype
or there is embedded apache-tika parser, that can retrieve text from pdf..

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!