Scrap multiple urls with scrapy

强颜欢笑 提交于 2019-12-08 06:51:01

问题


How I can scrap multiple urls with scrapy ?

I am forced to make multiple crawler?

class TravelSpider(BaseSpider):
    name = "speedy"
    allowed_domains = ["example.com"]
    start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4),"http://example.com/superurl/top/page-%d/" % i for i in xrange(55)]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        item = TravelItem()
        item['url'] = hxs.select('//a[@class="out"]/@href').extract()
        out = "\n".join(str(e) for e in item['url']);
        print out

Python say:

NameError: name 'i' is not defined

But when I use one url it works fine !

   start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4)"]

回答1:


Your python syntax is incorrect, try:

start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4)] + \
    ["http://example.com/superurl/top/page-%d/" % i for i in xrange(55)]

If you need to write code to generate start requests, you can define a start_requests() method instead of using start_urls.




回答2:


You can initialize start_urls in __init__.py method:

from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider


class TravelItem(Item):
    url = Field()


class TravelSpider(BaseSpider):
    def __init__(self, name=None, **kwargs):
        self.start_urls = []
        self.start_urls.extend(["http://example.com/category/top/page-%d/" % i for i in xrange(4)])
        self.start_urls.extend(["http://example.com/superurl/top/page-%d/" % i for i in xrange(55)])

        super(TravelSpider, self).__init__(name, **kwargs)

    name = "speedy"
    allowed_domains = ["example.com"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        item = TravelItem()
        item['url'] = hxs.select('//a[@class="out"]/@href').extract()
        out = "\n".join(str(e) for e in item['url']);
        print out

Hope that helps.




回答3:


There are only four ranges in Python: LEGB, because the local scope of the class definition and the local extent of the list derivation are not nested functions, so they do not form the Enclosing scope.Therefore, they are two separate local scopes that cannot be accessed from each other.

so, don't use 'for' and class variables at the same time



来源:https://stackoverflow.com/questions/16103938/scrap-multiple-urls-with-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!