问题
I'm using scrapy to scrape the adidas site: http://www.adidas.com/us/men-shoes
.
But it always shows error:
User timeout caused connection failure: Getting http://www.adidas.com/us/men-shoes took longer than 180.0 seconds..
It retries for 5 times and then fails completely.
I can access the url on chrome but it's not working on scrapy.
I've tried using custom user agents and emulating header requests but It's still doesn't work.
Below is my code:
import scrapy
class AdidasSpider(scrapy.Spider):
name = "adidas"
def start_requests(self):
urls = ['http://www.adidas.com/us/men-shoes']
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Host": "www.adidas.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
for url in urls:
yield scrapy.Request(url, self.parse, headers=headers)
def parse(self, response):
yield(response.body)
Scrapy log:
{'downloader/exception_count': 1,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
'downloader/request_bytes': 224,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2018, 1, 25, 10, 59, 35, 57000),
'log_count/DEBUG': 2,
'log_count/INFO': 9,
'retry/count': 1,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 1, 25, 10, 58, 39, 550000)}
Update
After looking at the request headers using fiddler and doing some tests I found what was causing the issue. Scrapy is sending a Connection: close
header by default due to which I'm not getting any response from the adidas site.
After testing on fiddler by making the same request but without the Connection: close
header, I got the response correctly. Now the problem is how to remove the Connection: close
header?
回答1:
As scrapy doesn't let you to edit the Connection: close
header. I used scrapy-splash instead to make the requests using splash.
Now the Connection: close
header can be overidden and everythings working now. The downside is that now the web page has to load all the the assets before I get the response from splash, slower but works.
Scrapy should add the option to edit their default Connection: close
header. It is hardcoded in the library and cannot be overidden easily.
Below is my working code:
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Host": "www.adidas.com",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
def start_requests(self):
url = "http://www.adidas.com/us/men-shoes?sz=120&start=0"
yield SplashRequest(url, self.parse, headers=self.headers)
回答2:
well, at least you should use the headers you wrote by adding 'headers=headers' to your scrapy.Request. However, it's still not working even after i tried to yield scrapy.Request(url, self.parse, headers=headers)
So next i changed the User-Agent in the settings.py with the one from your headers, i.e. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" and didn't use the headers you wrote in scrapy.Request, it worked.
Maybe there is something wrong in the headers. But i'm pretty sure it's not about cookies.
回答3:
I tried accessing the the site using curl
and the connection hangs.
curl -v -L http://www.adidas.com/us/men-shoes
So I jumped into the browser's debugger and noticed there was a Cookie
header in the request. So then I copied the entire value from the header and pasted it into the curl --headers
command.
curl -v -L -H 'Cookie:<cookie value here>' http://www.adidas.com/us/men-shoes
Now the HTML content returns. So the site, at some point, sets cookies that are required to access the remainder of the site. Unfortunately, I'm not sure where or how to acquire the cookies programmatically. Let us know if you figure it out. Hope this helps.
Update
Looks like there are ways to use persistent session data (ie. cookies) in Scrapy (I've never had to use it till this point :)). Take a look at this answer and this doc. I thought maybe the site was redirecting requests to set the cookie, but it's not. So it should be relatively simple problem to fix.
回答4:
Using your code, the first connection works just fine for me - it uses the headers you give and gets the correct response. I modified your parse
method to follow the product links and print the content of the <title>
tags from the received pages, and that worked fine too. Sample log and printout below. I suspect you're being slowed because of excessive requests.
2018-01-27 16:48:23 [scrapy.core.engine] INFO: Spider opened
2018-01-27 16:48:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-27 16:48:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-01-27 16:48:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.adidas.com/us/men-shoes> (referer: None)
2018-01-27 16:48:25 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.adidas.com/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2018-01-27 16:48:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.adidas.com/us/alphabounce-beyond-shoes/DB1126.html> from <GET http://www.adidas.com/us/alphabounce-beyond-shoes/DB1126.html>
2018-01-27 16:48:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.adidas.com/us/ultraboost-laceless-shoes/BB6137.html> from <GET http://www.adidas.com/us/ultraboost-laceless-shoes/BB6137.html>
<snipped a bunch>
2018-01-27 16:48:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.adidas.com/us/> (referer: http://www.adidas.com/us/men-shoes)
2018-01-27 16:48:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.adidas.com/us/nmd_cs2-primeknit-shoes/BY3012.html> (referer: http://www.adidas.com/us/men-shoes)
adidas Alphabounce Beyond Shoes - White | adidas US
adidas UA&SONS NMD R2 Shoes - Grey | adidas US
adidas NMD_C2 Shoes - Brown | adidas US
adidas NMD_CS2 Primeknit Shoes - Grey | adidas US
adidas NMD_Racer Primeknit Shoes - Black | adidas US
adidas Official Website | adidas US
adidas NMD_CS2 Primeknit Shoes - Black | adidas US
2018-01-27 16:48:26 [scrapy.core.engine] INFO: Closing spider (finished)
2018-01-27 16:48:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
回答5:
You could use this tool https://curl.trillworks.com/ to
- Get a curl command from Chrome
- Run the converted python code (I got response 200 from your URL via Requests)
- Copy the headers and cookies for your scrapy.Request
来源:https://stackoverflow.com/questions/48441336/scrapy-erroruser-timeout-caused-connection-failure