Apache nutch fetching but not saving file content

自作多情 提交于 2020-01-24 22:42:39

问题


I asked nutch to crawl a local file: http://localhost:8080/a.txt. I am running the HTTP server and I can see nutch trying to access the file (and before it, /robots.txt). I am using cassandra as backend.

However, I cannot see any data from the crawl. When I do ./bin/nutch readdb -dump data ..., I get the following output.

Can someone help me with a sane answer to this question? Where is the webpage data?

$ cat data/part-r-00000 http://localhost:8000/a.html key: localhost:http:8000/a.html baseUrl: null status: 2 (status_fetched) fetchTime: 1426811920382 prevFetchTime: 1424219908314 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus: (null)

来源:https://stackoverflow.com/questions/28574566/apache-nutch-fetching-but-not-saving-file-content

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!