问题
I asked nutch to crawl a local file: http://localhost:8080/a.txt. I am running the HTTP server and I can see nutch trying to access the file (and before it, /robots.txt). I am using cassandra as backend.
However, I cannot see any data from the crawl. When I do ./bin/nutch readdb -dump data ..., I get the following output.
Can someone help me with a sane answer to this question? Where is the webpage data?
$ cat data/part-r-00000 http://localhost:8000/a.html key: localhost:http:8000/a.html baseUrl: null status: 2 (status_fetched) fetchTime: 1426811920382 prevFetchTime: 1424219908314 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus: (null)
来源:https://stackoverflow.com/questions/28574566/apache-nutch-fetching-but-not-saving-file-content