Web scraping with Python using BeautifulSoup 429 error

你离开我真会死。 提交于 2019-12-29 09:17:26

问题


Fist I have to say that I'm quite new to Web scraping with Python. I'm trying to scrape datas using these lines of codes

import requests
from bs4 import BeautifulSoup
baseurl ='https://name_of_the_website.com'
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
print(soup)

As output I do not get the expected Html page but another Html page that says : Misbehaving Content Scraper Please use robots.txt Your IP has been rate limited

To check the problem I wrote:

try:
page_response = requests.get(baseurl, timeout =5)
 if page_response.status_code ==200:
   html_page = requests.get(baseurl).text
   soup = BeautifulSoup(html_page, 'html.parser')

 else:
  print(page_response.status_code)
except requests.Timeout as e:
print(str(e))

Then I get 429 (too many requests).

What can I do to handle this problem? Does it mean I cannot print the Html of the page and does it prevent me to scrape any content of the page? Should I rotate the IP address ?


回答1:


If you are only hitting the page once and getting a 429 it's probably not you hitting them too much. You can't be sure the 429 error is accurate, it's simply what their webserver returned. I've seen pages return a 404 response code, yet the page was fine, and 200 response code on legit missing pages, just a misconfigured server. They may just return 429 from any bot, try changing your User-Agent to Firefox, Chrome, or "Robot Web Scraper 9000" and see what you get. Like this:

requests.get(baseurl, headers = {'User-agent': 'Super Bot 9000'})

to declare yourself as a bot or

requests.get(baseurl, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})

If you wish to more mimic a browser. Some sites return better searchable code if you just say you are a bot, others it's the opposite. It's basically the wild wild west, have to just try different things.

If you are indeed hitting them a lot, you need to sleep between calls. It's a server side response completely controlled by them. You will also want to investigate how your code interacts with robots.txt, that's a file usually on the root of the webserver with the rules it would like your spider to follow.

You can read about that here: Parsing Robots.txt in python

Spidering the web is fun and challenging, just remember that you could be blocked at anytime by any site for any reason, you are their guest. So tread nicely :)



来源:https://stackoverflow.com/questions/51638468/web-scraping-with-python-using-beautifulsoup-429-error

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!