Scrapy, hash tag on URLs

后端 未结 3 1005
半阙折子戏
半阙折子戏 2020-12-21 05:10

I\'m on the middle of a scrapping project using Scrapy.

I realized that Scrapy strips the URL from a hash tag to the end.

Here\'s the output from the shell:<

3条回答
  •  情书的邮戳
    2020-12-21 06:00


    This isn't something scrapy itself can change--the portion following the hash in the url is the fragment identifier which is used by the client (scrapy here, usually a browser) instead of the server.

    What probably happens when you fetch the page in a browser is that the page includes some JavaScript that looks at the fragment identifier and loads some additional data via AJAX and updates the page. You'll need to look at what the browser does and see if you can emulate it--developer tools like Firebug or the Chrome or Safari inspector make this easy.

    For example, if you navigate to http://twitter.com/also, you are redirected to http://twitter.com/#!/also. The actual URL loaded by the browser here is just http://twitter.com/, but that page then loads data (http://twitter.com/users/show_for_profile.json?screen_name=also) which is used to generate the page, and is, in this case, just JSON data you could parse yourself. You can see this happen using the Network Inspector in Chrome.

提交回复
热议问题