问题
I am using https://www.realtor.com/realestateagents/phoenix_az//pg-2 as my starting point. I want to go from page 2 to page 5 and each page in-between while collecting names and numbers. I am collecting information on page 2 perfectly however I can not get it to go to the next page without having to plug in a new url. I am trying to set up a loop to do this automatically however after coding what I thought would be a loop im just getting the information only on page 2 (the starting point) before the scraper stops. I am new too loops and have tried multiple ways but can get none to work.
Below is the complete code for now.
import requests
from requests import get
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import numpy as np
from numpy import arange
import pandas as pd
from time import sleep
from random import randint
headers = {'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/45.0.2454.101 Safari/537.36'),
'referer': 'https://www.realtor.com/realestateagents/phoenix_az//pg-2'}
my_url = 'https://www.realtor.com/realestateagents/phoenix_az//pg-2'
#opening up connection, grabbing the page
uClient = uReq(my_url)
#read page
page_html = uClient.read()
#close page
uClient.close()
pages = np.arange(2, 3, 1)
for page in pages:
page = requests.get("https://www.realtor.com/realestateagents/phoenix_az//pg-" , headers=headers)
#html parsing
page_soup = soup(page_html, "html.parser")
#finds all realtors on page
containers = page_soup.findAll("div",{"class":"agent-list-card clearfix"})
#creating csv file
filename = "phoenix.csv"
f = open(filename, "w")
headers = "agent_name, agent_number\n"
f.write(headers)
#controlling scrape speed
for container in containers:
try:
name = container.find('div', class_='agent-name text-bold')
agent_name = name.a.text.strip()
except AttributeError:
print("-")
try:
number = container.find('div', class_='agent-phone hidden-xs hidden-xxs')
agent_number = number.text.strip()
except AttributeError:
print("-")
except NameError:
print("-")
try:
print("name: " + agent_name)
print("number: " + agent_number)
except NameError:
print("-")
try:
f.write(agent_name + "," + agent_number + "\n")
except NameError:
print("-")
f.close()
回答1:
Not sure if that's what you need, but here's a working (and simplified) code based on your example that scrapes the first five pages.
If you take a close look, I'm using a for loop
to "move" thru the pages by appending the page number to the url. Then, I get the HTML, parse it for agent div, grab name and number (if None
then I add N/A
) and finally dump the list to a csv
file.
EDIT: To match the comments, I've added a city Pheonix
and a wait_for
feature that stops the script for any time between 1 to 10 seconds, adjustable.
import csv
import random
import time
import requests
from bs4 import BeautifulSoup
realtor_data = []
for page in range(1, 6):
print(f"Scraping page {page}...")
url = f"https://www.realtor.com/realestateagents/phoenix_az/pg-{page}"
soup = BeautifulSoup(requests.get(url).text, "html.parser")
for agent_card in soup.find_all("div", {"class": "agent-list-card clearfix"}):
name = agent_card.find("div", {"class": "agent-name text-bold"}).find("a")
number = agent_card.find("div", {"itemprop": "telephone"})
realtor_data.append(
[
name.getText().strip(),
number.getText().strip() if number is not None else "N/A",
"Pheonix",
],
)
wait_for = random.randint(1, 10)
print(f"Sleeping for {wait_for} seconds...")
time.sleep(wait_for)
with open("data.csv", "w") as output:
w = csv.writer(output)
w.writerow(["NAME:", "PHONE NUMBER:"])
w.writerows(realtor_data)
Output:
A .csv
file with realtor's name and phone number.
NAME: PHONE NUMBER: CITY:
------------------------ --------------- -------
Shawn Rogers (480) 313-7031 Pheonix
The Jason Mitchell Group (480) 470-1993 Pheonix
Kyle Caldwell (602) 390-2245 Pheonix
THE VALENTINE GROUP N/A Pheonix
Nancy Wolfe (602) 418-1010 Pheonix
Rhonda DuBois (623) 418-2970 Pheonix
Sabrina Hurley (602) 410-1985 Pheonix
Bryan Adams (480) 375-1292 Pheonix
DeAnn Fry (623) 748-3818 Pheonix
Esther P Goh (480) 703-3836 Pheonix
...
回答2:
You have to move between pages:
page_html = requests.get("https://www.realtor.com/realestateagents/phoenix_az//pg-"+ str(page), headers=headers)
#html parsing
page_soup = soup(page_html, "html.parser")
Also you have mistake with variable name it should be page_html
来源:https://stackoverflow.com/questions/64144929/webscraper-wont-loop-from-page-2-to-page-5