DEBUG: Crawled (404) | 易学教程

问题

This is my code:

# -*- coding: utf-8 -*-
import scrapy


class SinasharesSpider(scrapy.Spider):
    name = 'SinaShares'
    allowed_domains = ['money.finance.sina.com.cn/mkt/']
    start_urls = ['http://money.finance.sina.com.cn/mkt//']

    def parse(self, response):
        contents=response.xpath('//*[@id="list_amount_ctrl"]/a[2]/@class').extract()
        print(contents)

And I have set an user-agent in setting.py.

Then I get an error:

2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://money.finance.sina.com.cn/robots.txt> (referer: None)
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://money.finance.sina.com.cn/mkt//> (referer: None)

So How can I eliminate this error?

回答1:

The http-statuscode 404 is received because Scrapy is checking the /robots.txt by default. In your case this site does not exist and so a 404 is received but that does not have any impact. In case you want to avoid checking the robots.txt you can set ROBOTSTXT_OBEY = False in the settings.py.

Then the website is accessed successfully (http-statuscode 200). No content is printed because based on your xpath-selection nothing is selected. You have to fix your xpath-selection.

If you want to test different xpath- or css-selections in order to figure how to get your desired content, you might want to use the interactive scrapy shell:
scrapy shell "http://money.finance.sina.com.cn/mkt/"

You can find an example of a scrapy shell session in the official Scrapy documentation here.

回答2:

Maybe your ip is banned by the website,also you can need to add some cookies to crawling the data that you needed.

来源：https://stackoverflow.com/questions/61451154/debug-crawled-404

标签

python

scrapy