Nutch not crawling URLs except the one specified in seed.txt

前端未结

关注

 2  2049

悲&欢浪女 2021-01-15 19:53

I am using Apache Nutch 1.12 and the URLs I am trying to crawl is something like https://www.mywebsite.com/abc-def/ which is the only entry in my seed.txt file. Since I don\

2条回答

南方客 (楼主)

2021-01-15 20:09
Got that working after trying multiple things in last 2 days.Here is the solution:

Since the website I was crawling was very heavy, the property in nutch-default.xml was truncating it to 65536 bytes(default).The links I wanted to crawl unfortunately didn't get included in the selected part and hence nutch wasn't crawling it.When I changed it to unlimited by putting the following values in nutch-site.xml it starts crawling my pages :
```
  http.content.limit
  -1
  The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...