Read Time out when attempting to request a page

问题

I am attempting to scrape websites and I sometimes get this error and it is concerning as I randomly get this error but after i retry i do not get the error.

requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.somewebsite.com', port=443): Read timed out. (read timeout=None)

My code looks like the following

from bs4 import BeautifulSoup
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
import requests

software_names = [SoftwareName.CHROME.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value]
user_agent_rotator = UserAgent(software_names=software_names, operating_systems=operating_systems, limit=100)
pages_to_scrape = ['https://www.somewebsite1.com/page', 'https://www.somewebsite2.com/page242']

for page in pages_to_scrape:
  time.sleep(2)
  page = requests.get(page, headers={'User-Agent':user_agent_rotator.get_random_user_agent()})
  soup = BeautifulSoup(page.content, "html.parser")
  # scrape info

As you can see from my code I even use Time to sleep my script for a couple of seconds before requesting another page. I also use a random user_agent. I am not sure if i can do anything else to make sure I never get the Read Time out error.

I also came across this but it seems they are suggesting to add additional values to the headers but I am not sure if that is a generic solution because that may have to be specific from website to website. I also read on another SO Post that we should base64 the request and retry. It went over my head as I have no idea how to do that and there was not a example provided by the person.

Any advice by those who have experience in scraping would highly be appreciated.

回答1:

well, I've verified your issue. Basically that site is using AkamaiGHost firewall.

curl -s -o /dev/null -D - https://www.uniqlo.com/us/en/men/t-shirts

which will block your requests if it's without valid User-Agent and should be stable. you don't need to change it on each request. also you will need to use requests.Session() to persist the session and not causing TCP layer to drop the packets, I've been able to send 1k requests within the second and didn't get blocked. even i verified if the bootstrap will block the request if i parsed the HTML source but it didn't at all.

being informed that i launched all my tests using Google DNS which will never cause a latency on my threading which can lead the firewall to drop the requests and define it as DDOS attack. One point to be noted as well. DO NOT USE timeout=None as that's will cause the request to wait forever for a response where in the back-end the firewall is automatically detecting any TCP listener which in pending state and automatically drop it and block the origin IP which is you. that's based on time configured :) –

import requests
from concurrent.futures.thread import ThreadPoolExecutor
from bs4 import BeautifulSoup


def Test(num):
    print(f"Thread# {num}")
    with requests.session() as req:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}
        r = req.get(
            "https://www.uniqlo.com/us/en/men/t-shirts", headers=headers)
        soup = BeautifulSoup(r.text, 'html.parser')
        if r.status_code == 200:
            return soup.title.text
        else:
            return f"Thread# {num} Failed"


with ThreadPoolExecutor(max_workers=20) as executor:
    futures = executor.map(Test, range(1, 31))
    for future in futures:
        print(future)

Run It Online

回答2:

ReadTimeout exceptions are commonly caused by the following

Making too many requests in a givin time period
Making too many requests at the same time
Using too much bandwidth, either on your end or theirs

It looks like your are making 1 request every 2 seconds. For some websites this is fine, others could be call this a denial-of-service attack. Google for example will slow down or block requests that occur to frequently.

Some sites will also limit the requests if you don't provide the right information in the header, or if they think your a bot.

To solve this try the following:

Increase the time between requests. For Google, 30-45 seconds works for me if I am not using an API
Decrease the number of concurrent requests.
Have a look at the network requests that occur when you visit the site in your browser, and try to mimic them.
Use a package like selenium to make your activity look less like a bot.

来源：https://stackoverflow.com/questions/60481347/read-time-out-when-attempting-to-request-a-page

标签

python

web-scraping

beautifulsoup

python-requests