Web crawler Using Twisted

问题

I am trying to create a web crawler with python and twisted.What happend is that at the time of calling reactor.run()

I don't know all the link to get. so the code goes like:

def crawl(url):
    d = getPage(url)
    d.addCallback(handlePage)
    reactor.run()

and the handle page has something like:

def handlePage(output):
    urls = getAllUrls(output)

So now I need to apply the crawl() on each of the url in urls.How do I do that?Should I stop the reactor and start again?If I am missing something obvious please tell me.

回答1:

You don't want to stop the reactor. You just want to download more pages. So you need to refactor your crawl function to not stop or start the reactor.

def crawl(url):
    d = getPage(url)
    d.addCallback(handlePage)

def handlePage(output):
    urls = getAllUrls(output)
    for url in urls:
        crawl(url)

crawl(url)
reactor.run()

You may want to look at scrapy instead of building your own from scratch, though.

来源：https://stackoverflow.com/questions/10209229/web-crawler-using-twisted

标签

python

web-crawler

twisted

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!