How to accelerate Webscraping using the combination of Request and BeautifulSoup in Python?

问题

The objective is to scrape multiple pages using BeautifulSoup whose input is from the requests.get module.

The steps are:

First load up the html using requests

page = requests.get('https://oatd.org/oatd/' + url_to_pass)

Then, scrape the html content using the definition below:

def get_each_page(page_soup):
    return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
                paper_title=page_soup.find(attrs={"itemprop": "name"}).text)

Say, we have a hundred of unique url to be scrap ['record?record=handle\:11012\%2F16478&q=eeg'] * 100, the whole process can be completed via the code below:

import requests
from bs4 import BeautifulSoup as Soup

def get_each_page(page_soup):
    return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
                paper_title=page_soup.find(attrs={"itemprop": "name"}).text)

list_of_url = ['record?record=handle\:11012\%2F16478&q=eeg'] * 100 # In practice, there will be 100 diffrent unique sub-href. But for illustration purpose, we purposely duplicate the url
all_website_scrape = []
for url_to_pass in list_of_url:

    page = requests.get('https://oatd.org/oatd/' + url_to_pass)
    if page.status_code == 200:
        all_website_scrape.append(get_each_page(Soup(page.text, 'html.parser')))

However, each of the url is requested and scrape one each a time, hence in principle time consuming.

I wonder if there is other way to increase the performance of the above code that I am not aware of?

回答1:

realpython.com has a nice article about speeding up python scripts up with concurrency.

https://realpython.com/python-concurrency/

Using their example for threading, you can set the number of workers to execute multiple threads which increase the number of requests you can make at once.

    from bs4 import BeautifulSoup as Soup
    import concurrent.futures
    import requests
    import threading
    import time
    
    def get_each_page(page_soup):
        return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
                    paper_title=page_soup.find(attrs={"itemprop": "name"}).text)
    
    def get_session():
        if not hasattr(thread_local, "session"):
            thread_local.session = requests.Session()
        return thread_local.session
    
    def download_site(url_to_pass):
        session = get_session()
        page = session.get('https://oatd.org/oatd/' + url_to_pass, timeout=10)
        print(f"{page.status_code}: {page.reason}")
        if page.status_code == 200:
            all_website_scrape.append(get_each_page(Soup(page.text, 'html.parser')))
    
    def download_all_sites(sites):
        with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
            executor.map(download_site, sites)
    
    if __name__ == "__main__":
        list_of_url = ['record?record=handle\:11012\%2F16478&q=eeg'] * 100  # In practice, there will be 100 diffrent unique sub-href. But for illustration purpose, we purposely duplicate the url
        all_website_scrape = []
        thread_local = threading.local()
        start_time = time.time()
        download_all_sites(list_of_url)
        duration = time.time() - start_time
        print(f"Downloaded {len(all_website_scrape)} in {duration} seconds")

回答2:

You maybe can use the threading module. You can make the script multi threaded and go much faster. https://docs.python.org/3/library/threading.html

But if you are willing to change your mind ill recommend scrapy

来源：https://stackoverflow.com/questions/62859338/how-to-accelerate-webscraping-using-the-combination-of-request-and-beautifulsoup

标签

python

web-scraping

beautifulsoup

python-requests