Web scraping with Python using BeautifulSoup 429 error

后端 未结 1 686
北荒
北荒 2020-12-21 20:59

Fist I have to say that I\'m quite new to Web scraping with Python. I\'m trying to scrape datas using these lines of codes

import requests
from bs4 import Be         


        
相关标签:
1条回答
  • 2020-12-21 21:13

    If you are only hitting the page once and getting a 429 it's probably not you hitting them too much. You can't be sure the 429 error is accurate, it's simply what their webserver returned. I've seen pages return a 404 response code, yet the page was fine, and 200 response code on legit missing pages, just a misconfigured server. They may just return 429 from any bot, try changing your User-Agent to Firefox, Chrome, or "Robot Web Scraper 9000" and see what you get. Like this:

    requests.get(baseurl, headers = {'User-agent': 'Super Bot Power Level Over 9000'})
    

    to declare yourself as a bot or

    requests.get(baseurl, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})
    

    If you wish to more mimic a browser. Note all the version stuff mimicing a browser, at the time of this writing those are current. You may need later version numbers. Just find your user agent of the browser you use, this page will tell you what that is:

    https://www.whatismybrowser.com/detect/what-is-my-user-agent

    Some sites return better searchable code if you just say you are a bot, others it's the opposite. It's basically the wild wild west, have to just try different things.

    Another pro tip, you may have to write your code to have a 'cookie jar' or a way to accept a cookie. Usually it is just an extra line in your request, but I'll leave that for another stackoverflow question :)

    If you are indeed hitting them a lot, you need to sleep between calls. It's a server side response completely controlled by them. You will also want to investigate how your code interacts with robots.txt, that's a file usually on the root of the webserver with the rules it would like your spider to follow.

    You can read about that here: Parsing Robots.txt in python

    Spidering the web is fun and challenging, just remember that you could be blocked at anytime by any site for any reason, you are their guest. So tread nicely :)

    0 讨论(0)
提交回复
热议问题