Scrapy linkextractor ignores parameters behind the sign # and thus will not follow the link

≡放荡痞女 提交于 2019-12-24 23:55:21

问题


I am trying to crawl a website with scrapy where the pagination is behind the sign "#". This somehow makes scrapy ignore everything behind that character and it will always only see the first page.

e.g.:

http://www.rolex.de/de/watches/find-rolex.html#g=1&p=2

If you enter a question mark manually, the site will load page 1

http://www.rolex.de/de/watches/find-rolex.html?p=2

The stats from scrapy tell me it fetched the first page:

DEBUG: Crawled (200) http://www.rolex.de/de/watches/datejust/m126334-0014.html> (referer: http://www.rolex.de/de/watches/find-rolex.html)

My crawler looks like this:

start_urls = [
    'http://www.rolex.de/de/watches/find-rolex.html#g=1',
    'http://www.rolex.de/de/watches/find-rolex.html#g=0&p=2',
    'http://www.rolex.de/de/watches/find-rolex.html#g=0&p=3',
]

rules = (
    Rule(
        LinkExtractor(allow=['.*/de/watches/.*/m\d{3,}.*.\.html']), 
        callback='parse_item'
    ),       
    Rule(
        LinkExtractor(allow=['.*/de/watches/find-rolex(/.*)?\.html#g=1(&p=\d*)?$']), 
        follow=True
    ),
)

How can I make scrapy ignore the # inside the url and visit the given URL?


回答1:


Scrapy performs HTTP requests. The data after '#' in a URL is not part of an HTTP request, it is used by JavaScript.

As suggested in the comments, the site loads the data using AJAX.

Moreover, it does not use pagination in AJAX: the site downloads the whole list of watches as JSON in a single request, and then the pagination is done using JavaScript.

So, you can just use the Network tab of the developer tools of your web browser to see the request that obtains the JSON data, and perform a similar request instead of requesting the HTML page.

Note, however, that you cannot use LinkExtractor for JSON data. You should simply parse the response with Python’s json and iterate the URLs there.



来源:https://stackoverflow.com/questions/54061112/scrapy-linkextractor-ignores-parameters-behind-the-sign-and-thus-will-not-foll

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!