How check if website support http, htts and www prefix with scrapy

问题

I am using scrapy to check, if some website works fine, when I use http://example.com, https://example.com or http://www.example.com. When I create scrapy request, it works fine. for example, on my page1.com, it is always redirected to https://. I need to get this information as return value, or is there better way how to get this information using scrapy?

class myspider(scrapy.Spider):
    name = 'superspider'

    start_urls = [
        "https://page1.com/"
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        url = response.url
        # removing all possible prefixes from url
        for remove in ['https://', 'http://', 'www.']:
            url = str(url).replace(remove, '').rstrip('/')

        # Try with all possible prefixes
        for prefix in ['http://', 'http://www.', 'https://', 'https://www.']:
            yield scrapy.Request(url='{}{}'.format(prefix, url), callback=self.test, dont_filter=True)

    def test(self, response):
        print(response.url, response.status)

The output of this spider is this:

https://page1.com 200
https://page1.com/ 200
https://page1.com/ 200
https://page1.com/ 200

This is nice, but I would like to get this information as return value to know, that e.g. on http is response code 200 and than save it to dictionary for later processing or save it as json to file(using items in scrapy).

DESIRED OUTPUT: I would like to have dictionary named a with all information:

print(a)
{'https://': True, 'http://': True, 'https://www.': True, 'http://www.': True}

Later I would like to scrape more information, so I will need to store all information under one object/json/...

回答1:

Instead of using the meta possibility which was pointed out by eLRuLL you can parse request.url:

scrapy shell http://stackoverflow.com
In [1]: request.url
Out[1]: 'http://stackoverflow.com'

In [2]: response.url
Out[2]: 'https://stackoverflow.com/'

To store the values for different runs together in one dict/json you can use an additional pipeline like mentioned in https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter So you have something like:

Class WriteAllRequests(object):
    def __init__(self):
        self.urldic={}

    def process_item(self, item, spider):
        urldic[item.url]={item.urlprefix=item.urlstatus}
        if len(urldic[item.url])==4:
            # think this can be passed to a standard pipeline with a higher number
            writedata (urldic[item.url])

            del urldic[item.url]

You must additionally activate the pipeline

回答2:

you are doing one extra request at the beginning of the spider and you could deal with all those domains with the start_requests method:

class myspider(scrapy.Spider):
    name = 'superspider'

    def start_requests(self):
        url = response.url
        # removing all possible prefixes from url
        for remove in ['https://', 'http://', 'www.']:
            url = str(url).replace(remove, '').rstrip('/')

        # Try with all possible prefixes
        for prefix in ['http://', 'http://www.', 'https://', 'https://www.']:
            yield scrapy.Request(
                url='{}{}'.format(prefix, url), 
                callback=self.parse, 
                dont_filter=True, 
                meta={'prefix': prefix},
            )

    def parse(self, response):
        yield {response.meta['prefix']: True}

check that I am using the meta request parameter to pass the information to the next callback method on which prefix was used.

来源：https://stackoverflow.com/questions/52104875/how-check-if-website-support-http-htts-and-www-prefix-with-scrapy

标签

python

scrapy