Python Scrapy: Convert relative paths to absolute paths

余生颓废 提交于 2019-11-27 07:51:42

What i do is:

import urlparse
...

def parse(self, response):
    ...
    urlparse.urljoin(response.url, extractedLink.strip())
    ...

Notice strip(), because i meet sometimes strange links like:

<a href="
              /MID_BRAND_NEW!%c2%a0MID_70006_Google_Android_2.2_7%22%c2%a0Tablet_PC_Silver/a904326516.html
            ">MID BRAND NEW!&nbsp;MID 70006 Google Android 2.2 7"&nbsp;Tablet PC Silver</a>

From Scrapy docs:

def parse(self, response):
    # ... code ommited
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, self.parse)

that is, response object has a method to do exactly this.

from scrapy.utils.response import get_base_url

base_url           = get_base_url(response)
relative_url       = site.select('//*[@id="showImage"]/@src').extract()
item['image_urls'] = [urljoin_rfc(base_url,ru) for ru in relative_url]

or you could extract just one item

base_url           = get_base_url(response)
relative_url       = site.select('//*[@id="showImage"]/@src').extract()[0]
item['image_urls'] = urljoin_rfc(base_url,relative_url)

The error was because you were passing a list instead of a str to urljoin function.

Several notes:

items = []
for site in sites:
    item = DmozItem()
    item['manufacturer'] = 'Namaste Foods'
    ...
    items.append(item)
return items

I do it differently:

for site in sites:
    item = DmozItem()
    item['manufacturer'] = 'Namaste Foods'
    ...
    yield item

Then:

relative_url = site.select('//*[@id="showImage"]/@src').extract()
item['image_urls'] = urljoin_rfc(base_url, relative_url)

extract() always returns a list, because an xpath query always returns a list of selected nodes.

Do this:

relative_url = site.select('//*[@id="showImage"]/@src').extract()[0]
item['image_urls'] = urljoin_rfc(base_url, relative_url)

A more general approach to obtaining an absolute url would be

import urlparse

def abs_url(url, response):
  """Return absolute link"""
  base = response.xpath('//head/base/@href').extract()
  if base:
    base = base[0]
  else:
    base = response.url
  return urlparse.urljoin(base, url)

This also works when a base element is present.

In your case, you'd use it like this:

def parse(self, response):
  # ...
  for site in sites:
    # ...
    image_urls = site.select('//*[@id="showImage"]/@src').extract()
    if image_urls: item['image_urls'] = abs_url(image_urls[0], response)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!