Nutch - does not crawl, says “Stopping at depth=1 - no more URLs to fetch”

夙愿已清 提交于 2020-01-25 23:49:30

问题


It's been long since I've been trying to crawl using Nutch but it just doesn't seem to run. I'm trying to build a SOLR search for a website and using Nutch for crawling and indexing in Solr.

There have been some permission problems originally but they have been fixed now. The URL I'm trying to crawl is http://172.30.162.202:10200/, which is not publicly accessible. It is an internal URL that can be reached from the Solr server. I tried browsing it using Lynx.

Given below is the output from the Nutch command:

[abgu01@app01 local]$ ./bin/nutch crawl /home/abgu01/urls/url1.txt -dir /home/abgu01/crawl -depth 5 -topN 100
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /opt/apache-nutch-1.4-bin/runtime/local/logs/hadoop.log (No such file or directory)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:212)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:136)
        at org.apache.log4j.FileAppender.setFile(FileAppender.java:290)
        at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:164)
        at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:216)
        at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:257)
        at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:133)
        at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:97)
        at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:689)
        at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647)
        at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544)
        at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440)
        at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476)
        at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:471)
        at org.apache.log4j.LogManager.<clinit>(LogManager.java:125)
        at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:73)
        at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:242)
        at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:254)
        at org.apache.nutch.crawl.Crawl.<clinit>(Crawl.java:43)
log4j:ERROR Either File or DatePattern options are not set for appender [DRFA].
solrUrl is not set, indexing will be skipped...
crawl started in: /home/abgu01/crawl
rootUrlDir = /home/abgu01/urls/url1.txt
threads = 10
depth = 5
solrUrl=null
topN = 100
Injector: starting at 2012-07-27 15:47:00
Injector: crawlDb: /home/abgu01/crawl/crawldb
Injector: urlDir: /home/abgu01/urls/url1.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-07-27 15:47:03, elapsed: 00:00:02
Generator: starting at 2012-07-27 15:47:03
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: /home/abgu01/crawl/segments/20120727154705
Generator: finished at 2012-07-27 15:47:06, elapsed: 00:00:03
Fetcher: starting at 2012-07-27 15:47:06
Fetcher: segment: /home/abgu01/crawl/segments/20120727154705
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://172.30.162.202:10200/
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-07-27 15:47:08, elapsed: 00:00:02
ParseSegment: starting at 2012-07-27 15:47:08
ParseSegment: segment: /home/abgu01/crawl/segments/20120727154705
ParseSegment: finished at 2012-07-27 15:47:09, elapsed: 00:00:01
CrawlDb update: starting at 2012-07-27 15:47:09
CrawlDb update: db: /home/abgu01/crawl/crawldb
CrawlDb update: segments: [/home/abgu01/crawl/segments/20120727154705]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-07-27 15:47:10, elapsed: 00:00:01
Generator: starting at 2012-07-27 15:47:10
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2012-07-27 15:47:11
LinkDb: linkdb: /home/abgu01/crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/home/abgu01/crawl/segments/20120727154705
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

Can anyone please suggest what could be the reason for crawl not running? It always ends by saying "Stopping at depth=1 - no more URLs to fetch" irrespective of the value of depth or topN parameters. And I think the reason for it is (looking at the output above) that Fetcher isn't able to fetch any content from the URL.

Any inputs are appreciated!


回答1:


A site could blocks crawling via robots.txt / or meta(name="robots" content="noindex") tag. Please check.

PS. your log isn't clear: 1. java.io.FileNotFoundException: /opt/apache-nutch-1.4-bin/runtime/local/logs/hadoop.log (No such file or directory)
2. solrUrl is not set, indexing will be skipped...



来源:https://stackoverflow.com/questions/11710492/nutch-does-not-crawl-says-stopping-at-depth-1-no-more-urls-to-fetch

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!