Why doesn't Nutch seem to know about “Last-Modified”?

不羁的心 提交于 2019-12-24 01:25:52

问题


I setup Nutch with a db.fetch.interval.default of 60000 so that I can crawl every day. If I don't, it won't even look at my site when I crawl the next day. But when I do crawl the next day, every page that it fetched yesterday gets fetched with a 200 response code, indicating that it's not using the previous day's date in the "If-Modified-Since". Shouldn't it skip fetching pages that haven't changed? Is there a way to make it do that? I noticed a ProtocolStatus.NOT_MODIFIED in Fetcher.java, so I think it should be able to do this, shouldn't it?

By the way, this is cut and pasted from conf/nutch-default.xml from the current trunk:

<!-- web db properties -->

<property>
  <name>db.default.fetch.interval</name>
  <value>30</value>
  <description>(DEPRECATED) The default number of days between re-fetches of a page.
  </description>
</property>

<property>
  <name>db.fetch.interval.default</name>
  <value>2592000</value>
  <description>The default number of seconds between re-fetches of a page (30 days).
  </description>
</property>

回答1:


I found the problem. It's a bug in Nutch. I've emailed the Nutch developer list about it, but here's my fix:

Index: src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
===================================================================
--- src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java  (revision 802632)
+++ src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java  (working copy)
@@ -124,11 +124,15 @@
         reqStr.append("\r\n");
       }

-      reqStr.append("\r\n");
       if (datum.getModifiedTime() > 0) {
         reqStr.append("If-Modified-Since: " + HttpDateFormat.toString(datum.getModifiedTime()));
         reqStr.append("\r\n");
       }
+      else if (datum.getFetchTime() > 0) {
+          reqStr.append("If-Modified-Since: " + HttpDateFormat.toString(datum.getFetchTime()));
+          reqStr.append("\r\n");
+      }
+      reqStr.append("\r\n");     

       byte[] reqBytes= reqStr.toString().getBytes();

Now I'm seeing 304s in my Apache logs where I'm supposed to be seeing them.




回答2:


I think you are mistaken with an option name - db.fetch.interval.default. It should be.

db.default.fetch.interval

The number of days after each page injected is fetched that it should next be fetched. 30 by default.

I just read change log of the latest version, and found following

  1. NUTCH-61 - Support for adaptive re-fetch interval and detection of unmodified content. (ab)

If you don't have latest version installed, I suggest you to do that.

Also, are you using -adddays option for crawling?



来源:https://stackoverflow.com/questions/1252289/why-doesnt-nutch-seem-to-know-about-last-modified

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!