问题
I'm scraping a website which loads product data from individual JSON files. I found the URLs to the JSONs by inspecting the network traffic.
The problem is this: when I follow the JSON URLs, most of the links will provide a JSON result. But the JSON URLs of products that have special characters in them, eg é, return a null response. Of course the data is shown on the browser but I can't seem to get the JSON response directly.
Any tips?
(I'm trying to find a similar website that acts in the same way so I can post it here for example)
EDIT:
Here is an example
Product A url: https://www.boozebud.com/p/hopnationbrewingco/thedamned
WORKS: A's JSON url: https://www.boozebud.com/a/producturl/p/hopnationbrewingco/thedamned
Product B url: https://www.boozebud.com/p/àbloc/superprestigenaturalblondebeer
RETURNS NULL: B's JSON url: https://www.boozebud.com/a/producturl/p/àbloc/superprestigenaturalblondebeer
(Related to my previous unanswered question: scrapy: dealing with special characters in url which might need to be revised in light of this question)
回答1:
It seems to me that the problem is the headers, it seems to be very sensitive to at least the Content-Type header, it seems it's used internally on the server to decode the incoming URL or something like that.
Try downloading the request like this (this is what the internal js is doing)
yield Request('https://www.boozebud.com/a/producturl/p/%C3%A0bloc/superprestigenaturalblondebeer',
headers={"Content-Type": "application/json; charset=UTF-8"})
来源:https://stackoverflow.com/questions/47563095/json-url-sometimes-returns-a-null-response