I have amended the code based on solutions offered below by the great folks here; I get the error shown below the code here.
from scrapy.spider import BaseSp
from scrapy.utils.response import get_base_url
base_url = get_base_url(response)
relative_url = site.select('//*[@id="showImage"]/@src').extract()
item['image_urls'] = [urljoin_rfc(base_url,ru) for ru in relative_url]
or you could extract just one item
base_url = get_base_url(response)
relative_url = site.select('//*[@id="showImage"]/@src').extract()[0]
item['image_urls'] = urljoin_rfc(base_url,relative_url)
The error was because you were passing a list instead of a str to urljoin function.
What i do is:
import urlparse
...
def parse(self, response):
...
urlparse.urljoin(response.url, extractedLink.strip())
...
Notice strip()
, because i meet sometimes strange links like:
<a href="
/MID_BRAND_NEW!%c2%a0MID_70006_Google_Android_2.2_7%22%c2%a0Tablet_PC_Silver/a904326516.html
">MID BRAND NEW! MID 70006 Google Android 2.2 7" Tablet PC Silver</a>
A more general approach to obtaining an absolute url would be
import urlparse
def abs_url(url, response):
"""Return absolute link"""
base = response.xpath('//head/base/@href').extract()
if base:
base = base[0]
else:
base = response.url
return urlparse.urljoin(base, url)
This also works when a base element is present.
In your case, you'd use it like this:
def parse(self, response):
# ...
for site in sites:
# ...
image_urls = site.select('//*[@id="showImage"]/@src').extract()
if image_urls: item['image_urls'] = abs_url(image_urls[0], response)
Several notes:
items = []
for site in sites:
item = DmozItem()
item['manufacturer'] = 'Namaste Foods'
...
items.append(item)
return items
I do it differently:
for site in sites:
item = DmozItem()
item['manufacturer'] = 'Namaste Foods'
...
yield item
Then:
relative_url = site.select('//*[@id="showImage"]/@src').extract()
item['image_urls'] = urljoin_rfc(base_url, relative_url)
extract()
always returns a list, because an xpath query always returns a list of selected nodes.
Do this:
relative_url = site.select('//*[@id="showImage"]/@src').extract()[0]
item['image_urls'] = urljoin_rfc(base_url, relative_url)
From Scrapy docs:
def parse(self, response):
# ... code ommited
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, self.parse)
that is, response
object has a method to do exactly this.