Scrape Multiple URLs using Beautiful Soup

那年仲夏 提交于 2019-12-31 17:59:31

问题


I'm trying to extract specific classes from multiple URLs. The tags and classes stay the same but I need my python program to scrape all as I just input my link.

Here's a sample of my work:

from bs4 import BeautifulSoup
import requests
import pprint
import re
import pyperclip

url = input('insert URL here: ')
#scrape elements
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

#print titles only
h1 = soup.find("h1", class_= "class-headline")
print(h1.get_text())

This works for individual URLs but not for a batch. Thanks for helping me. I learned a lot from this community.


回答1:


Have a list of urls and iterate through it.

from bs4 import BeautifulSoup
import requests
import pprint
import re
import pyperclip

urls = ['www.website1.com', 'www.website2.com', 'www.website3.com', .....]
#scrape elements
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    #print titles only
    h1 = soup.find("h1", class_= "class-headline")
    print(h1.get_text())

If you are going to prompt user for input for each site then it can be done this way

from bs4 import BeautifulSoup
import requests
import pprint
import re
import pyperclip

urls = ['www.website1.com', 'www.website2.com', 'www.website3.com', .....]
#scrape elements
msg = 'Enter Url, to exit type q and hit enter.'
url = input(msg)
while(url!='q'):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    #print titles only
    h1 = soup.find("h1", class_= "class-headline")
    print(h1.get_text())
    input(msg)



回答2:


If you want to scrape links in batches. Specify a batch size and iterate over it.

from bs4 import BeautifulSoup
import requests
import pprint
import re
import pyperclip

batch_size = 5
urllist = ["url1", "url2", "url3", .....]
url_chunks = [urllist[x:x+batch_size] for x in xrange(0, len(urllist), batch_size)]

def scrape_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    h1 = soup.find("h1", class_= "class-headline")
    return (h1.get_text())

def scrape_batch(url_chunk):
    chunk_resp = []
    for url in url_chunk:
        chunk_resp.append(scrape_url(url))
    return chunk_resp

for url_chunk in url_chunks:
    print scrape_batch(url_chunk)


来源:https://stackoverflow.com/questions/40629457/scrape-multiple-urls-using-beautiful-soup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!