Avoid bad requests due to relative urls

问题

I am trying to crawl a website using Scrapy, and the urls of every page I want to scrap are all written using a relative path of this kind:

<!-- on page https://www.domain-name.com/en/somelist.html (no <base> in the <head>) -->
<a href="../../en/item-to-scrap.html">Link</a>

Now, in my browser, these links work, and you get to urls like https://www.domain-name.com/en/item-to-scrap.html (despite the relative path going back up twice in hierarchy instead of once)

But my CrawlSpider does not manage to translate these urls into a "correct" one, and all I get is errors of that kind:

2013-10-13 09:30:41-0500 [domain-name.com] DEBUG: Retrying <GET https://www.domain-name.com/../en/item-to-scrap.html> (failed 1 times): 400 Bad Request

Is there a way to fix this, or am I missing something?

Here is my spider's code, fairly basic (on the basis of item urls matching "/en/item-*-scrap.html") :

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class Product(Item):
    name = Field()

class siteSpider(CrawlSpider):
    name = "domain-name.com"
    allowed_domains = ['www.domain-name.com']
    start_urls = ["https://www.domain-name.com/en/"]
    rules = (
        Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), callback='parse_item', follow=True),
        Rule(SgmlLinkExtractor(allow=('')), follow=True),
    )

    def parse_item(self, response):
        x = HtmlXPathSelector(response)
        product = Product()
        product['name'] = ''
        name = x.select('//title/text()').extract()
        if type(name) is list:
            for s in name:
                if s != ' ' and s != '':
                    product['name'] = s
                    break
        return product

回答1:

Basically deep down, scrapy uses http://docs.python.org/2/library/urlparse.html#urlparse.urljoin for getting the next url by joining currenturl and url link scrapped. And if you join the urls provided you mentioned as example,

<!-- on page https://www.domain-name.com/en/somelist.html -->
<a href="../../en/item-to-scrap.html">Link</a>

the returned url is same as url mentioned in error scrapy error. Try this in python shell.

import urlparse 
urlparse.urljoin("https://www.domain-name.com/en/somelist.html","../../en/item-to-scrap.html")

The urljoin behaviour seems to be valid. See : http://tools.ietf.org/html/rfc1808.html#section-5.2

If it is possible, can you pass the site, which you are crawling ?

With this understanding, the solutions can be,

1) Manipulate the urls(remove those two dots and slash). generated in crawl spider. Basically override parse or _request_to_folow.

Source of crawl spider: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py

2) Manipulate the url in the downloadmiddleware, this might be cleaner. You remove the ../ in the process_request of the downloadmiddleware.

Documentation for downloadmiddleware : http://scrapy.readthedocs.org/en/0.16/topics/downloader-middleware.html

3) Use base spider and also return the manipulated url requests you want to crawl further

Documentation for the basespider : http://scrapy.readthedocs.org/en/0.16/topics/spiders.html#basespider

Please let me know if you have any questions.

回答2:

I finally found a solution thanks to this answer. I used process_links as follows:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class Product(Item):
    name = Field()

class siteSpider(CrawlSpider):
    name = "domain-name.com"
    allowed_domains = ['www.domain-name.com']
    start_urls = ["https://www.domain-name.com/en/"]
    rules = (
        Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), process_links='process_links', callback='parse_item', follow=True),
        Rule(SgmlLinkExtractor(allow=('')), process_links='process_links', follow=True),
    )

    def parse_item(self, response):
        x = HtmlXPathSelector(response)
        product = Product()
        product['name'] = ''
        name = x.select('//title/text()').extract()
        if type(name) is list:
            for s in name:
                if s != ' ' and s != '':
                    product['name'] = s
                    break
        return product

    def process_links(self,links):
        for i, w in enumerate(links):
            w.url = w.url.replace("../", "")
            links[i] = w
        return links

来源：https://stackoverflow.com/questions/19769215/avoid-bad-requests-due-to-relative-urls

标签

python

scrapy

web-crawler