Nutch not crawling URLs except the one specified in seed.txt

前端 未结 2 2045
悲&欢浪女
悲&欢浪女 2021-01-15 19:53

I am using Apache Nutch 1.12 and the URLs I am trying to crawl is something like https://www.mywebsite.com/abc-def/ which is the only entry in my seed.txt file. Since I don\

2条回答
  •  南方客
    南方客 (楼主)
    2021-01-15 20:09

    Got that working after trying multiple things in last 2 days.Here is the solution:

    Since the website I was crawling was very heavy, the property in nutch-default.xml was truncating it to 65536 bytes(default).The links I wanted to crawl unfortunately didn't get included in the selected part and hence nutch wasn't crawling it.When I changed it to unlimited by putting the following values in nutch-site.xml it starts crawling my pages :

    
      http.content.limit
      -1
      The length limit for downloaded content using the http://
      protocol, in bytes. If this value is nonnegative (>=0), content longer
      than it will be truncated; otherwise, no truncation at all. Do not
      confuse this setting with the file.content.limit setting.
      
    
    

提交回复
热议问题