How check if website support http, htts and www prefix with scrapy

假装没事ソ 提交于 2021-01-28 01:14:02

问题


I am using scrapy to check, if some website works fine, when I use http://example.com, https://example.com or http://www.example.com. When I create scrapy request, it works fine. for example, on my page1.com, it is always redirected to https://. I need to get this information as return value, or is there better way how to get this information using scrapy?

class myspider(scrapy.Spider):
    name = 'superspider'

    start_urls = [
        "https://page1.com/"
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        url = response.url
        # removing all possible prefixes from url
        for remove in ['https://', 'http://', 'www.']:
            url = str(url).replace(remove, '').rstrip('/')

        # Try with all possible prefixes
        for prefix in ['http://', 'http://www.', 'https://', 'https://www.']:
            yield scrapy.Request(url='{}{}'.format(prefix, url), callback=self.test, dont_filter=True)

    def test(self, response):
        print(response.url, response.status)

The output of this spider is this:

https://page1.com 200
https://page1.com/ 200
https://page1.com/ 200
https://page1.com/ 200

This is nice, but I would like to get this information as return value to know, that e.g. on http is response code 200 and than save it to dictionary for later processing or save it as json to file(using items in scrapy).

DESIRED OUTPUT: I would like to have dictionary named a with all information:

print(a)
{'https://': True, 'http://': True, 'https://www.': True, 'http://www.': True}

Later I would like to scrape more information, so I will need to store all information under one object/json/...


回答1:


Instead of using the meta possibility which was pointed out by eLRuLL you can parse request.url:

scrapy shell http://stackoverflow.com
In [1]: request.url
Out[1]: 'http://stackoverflow.com'

In [2]: response.url
Out[2]: 'https://stackoverflow.com/'

To store the values for different runs together in one dict/json you can use an additional pipeline like mentioned in https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter So you have something like:

Class WriteAllRequests(object):
    def __init__(self):
        self.urldic={}

    def process_item(self, item, spider):
        urldic[item.url]={item.urlprefix=item.urlstatus}
        if len(urldic[item.url])==4:
            # think this can be passed to a standard pipeline with a higher number
            writedata (urldic[item.url])

            del urldic[item.url]

You must additionally activate the pipeline




回答2:


you are doing one extra request at the beginning of the spider and you could deal with all those domains with the start_requests method:

class myspider(scrapy.Spider):
    name = 'superspider'

    def start_requests(self):
        url = response.url
        # removing all possible prefixes from url
        for remove in ['https://', 'http://', 'www.']:
            url = str(url).replace(remove, '').rstrip('/')

        # Try with all possible prefixes
        for prefix in ['http://', 'http://www.', 'https://', 'https://www.']:
            yield scrapy.Request(
                url='{}{}'.format(prefix, url), 
                callback=self.parse, 
                dont_filter=True, 
                meta={'prefix': prefix},
            )

    def parse(self, response):
        yield {response.meta['prefix']: True}

check that I am using the meta request parameter to pass the information to the next callback method on which prefix was used.



来源:https://stackoverflow.com/questions/52104875/how-check-if-website-support-http-htts-and-www-prefix-with-scrapy

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!