Python Scrapy: Convert relative paths to absolute paths

前端 未结 5 1363
离开以前
离开以前 2020-11-30 06:12

I have amended the code based on solutions offered below by the great folks here; I get the error shown below the code here.

from scrapy.spider import BaseSp         


        
相关标签:
5条回答
  • 2020-11-30 06:19
    from scrapy.utils.response import get_base_url
    
    base_url           = get_base_url(response)
    relative_url       = site.select('//*[@id="showImage"]/@src').extract()
    item['image_urls'] = [urljoin_rfc(base_url,ru) for ru in relative_url]
    

    or you could extract just one item

    base_url           = get_base_url(response)
    relative_url       = site.select('//*[@id="showImage"]/@src').extract()[0]
    item['image_urls'] = urljoin_rfc(base_url,relative_url)
    

    The error was because you were passing a list instead of a str to urljoin function.

    0 讨论(0)
  • 2020-11-30 06:30

    What i do is:

    import urlparse
    ...
    
    def parse(self, response):
        ...
        urlparse.urljoin(response.url, extractedLink.strip())
        ...
    

    Notice strip(), because i meet sometimes strange links like:

    <a href="
                  /MID_BRAND_NEW!%c2%a0MID_70006_Google_Android_2.2_7%22%c2%a0Tablet_PC_Silver/a904326516.html
                ">MID BRAND NEW!&nbsp;MID 70006 Google Android 2.2 7"&nbsp;Tablet PC Silver</a>
    
    0 讨论(0)
  • A more general approach to obtaining an absolute url would be

    import urlparse
    
    def abs_url(url, response):
      """Return absolute link"""
      base = response.xpath('//head/base/@href').extract()
      if base:
        base = base[0]
      else:
        base = response.url
      return urlparse.urljoin(base, url)
    

    This also works when a base element is present.

    In your case, you'd use it like this:

    def parse(self, response):
      # ...
      for site in sites:
        # ...
        image_urls = site.select('//*[@id="showImage"]/@src').extract()
        if image_urls: item['image_urls'] = abs_url(image_urls[0], response)
    
    0 讨论(0)
  • 2020-11-30 06:39

    Several notes:

    items = []
    for site in sites:
        item = DmozItem()
        item['manufacturer'] = 'Namaste Foods'
        ...
        items.append(item)
    return items
    

    I do it differently:

    for site in sites:
        item = DmozItem()
        item['manufacturer'] = 'Namaste Foods'
        ...
        yield item
    

    Then:

    relative_url = site.select('//*[@id="showImage"]/@src').extract()
    item['image_urls'] = urljoin_rfc(base_url, relative_url)
    

    extract() always returns a list, because an xpath query always returns a list of selected nodes.

    Do this:

    relative_url = site.select('//*[@id="showImage"]/@src').extract()[0]
    item['image_urls'] = urljoin_rfc(base_url, relative_url)
    
    0 讨论(0)
  • 2020-11-30 06:41

    From Scrapy docs:

    def parse(self, response):
        # ... code ommited
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, self.parse)
    

    that is, response object has a method to do exactly this.

    0 讨论(0)
提交回复
热议问题