Webscraper won't loop from page 2 to page 5

六月ゝ 毕业季﹏ 提交于 2021-02-10 05:13:23

问题


I am using https://www.realtor.com/realestateagents/phoenix_az//pg-2 as my starting point. I want to go from page 2 to page 5 and each page in-between while collecting names and numbers. I am collecting information on page 2 perfectly however I can not get it to go to the next page without having to plug in a new url. I am trying to set up a loop to do this automatically however after coding what I thought would be a loop im just getting the information only on page 2 (the starting point) before the scraper stops. I am new too loops and have tried multiple ways but can get none to work.

Below is the complete code for now.

import requests
from requests import get
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup 
import numpy as np
from numpy import arange
import pandas as pd 

from time import sleep
from random import randint

headers = {'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)'
                          'AppleWebKit/537.36 (KHTML, like Gecko)'
                          'Chrome/45.0.2454.101 Safari/537.36'),
                          'referer': 'https://www.realtor.com/realestateagents/phoenix_az//pg-2'}

my_url = 'https://www.realtor.com/realestateagents/phoenix_az//pg-2'

#opening up connection, grabbing the page
uClient = uReq(my_url)
#read page 
page_html = uClient.read()
#close page
uClient.close()

pages = np.arange(2, 3, 1)

for page in pages:

    page = requests.get("https://www.realtor.com/realestateagents/phoenix_az//pg-" , headers=headers)

#html parsing
page_soup = soup(page_html, "html.parser")

#finds all realtors on page 
containers = page_soup.findAll("div",{"class":"agent-list-card clearfix"})

#creating csv file 
filename = "phoenix.csv"
f = open(filename, "w")

headers = "agent_name, agent_number\n"
f.write(headers)

#controlling scrape speed 


for container in containers:

    try:
        name = container.find('div', class_='agent-name text-bold')
        agent_name = name.a.text.strip()
    except AttributeError:
        print("-")

    try:
        number = container.find('div', class_='agent-phone hidden-xs hidden-xxs')
        agent_number = number.text.strip()
    except AttributeError:
        print("-")
    except NameError:
        print("-")

    try:
        print("name: " + agent_name)
        print("number: " + agent_number)
    except NameError:
        print("-")

    try:
        f.write(agent_name + "," + agent_number + "\n")
    except NameError:
        print("-")

f.close()

回答1:


Not sure if that's what you need, but here's a working (and simplified) code based on your example that scrapes the first five pages.

If you take a close look, I'm using a for loop to "move" thru the pages by appending the page number to the url. Then, I get the HTML, parse it for agent div, grab name and number (if None then I add N/A) and finally dump the list to a csv file.

EDIT: To match the comments, I've added a city Pheonix and a wait_for feature that stops the script for any time between 1 to 10 seconds, adjustable.

import csv
import random
import time

import requests
from bs4 import BeautifulSoup


realtor_data = []

for page in range(1, 6):
    print(f"Scraping page {page}...")
    url = f"https://www.realtor.com/realestateagents/phoenix_az/pg-{page}"
    soup = BeautifulSoup(requests.get(url).text, "html.parser")

    for agent_card in soup.find_all("div", {"class": "agent-list-card clearfix"}):
        name = agent_card.find("div", {"class": "agent-name text-bold"}).find("a")
        number = agent_card.find("div", {"itemprop": "telephone"})
        realtor_data.append(
            [
                name.getText().strip(),
                number.getText().strip() if number is not None else "N/A",
                "Pheonix",
             ],
        )
    wait_for = random.randint(1, 10)
    print(f"Sleeping for {wait_for} seconds...")
    time.sleep(wait_for)

with open("data.csv", "w") as output:
    w = csv.writer(output)
    w.writerow(["NAME:", "PHONE NUMBER:"])
    w.writerows(realtor_data)

Output:

A .csv file with realtor's name and phone number.

NAME:                     PHONE NUMBER:    CITY:
------------------------  ---------------  -------
Shawn Rogers              (480) 313-7031   Pheonix
The Jason Mitchell Group  (480) 470-1993   Pheonix
Kyle Caldwell             (602) 390-2245   Pheonix
THE VALENTINE GROUP       N/A              Pheonix
Nancy Wolfe               (602) 418-1010   Pheonix
Rhonda DuBois             (623) 418-2970   Pheonix
Sabrina Hurley            (602) 410-1985   Pheonix
Bryan Adams               (480) 375-1292   Pheonix
DeAnn Fry                 (623) 748-3818   Pheonix
Esther P Goh              (480) 703-3836   Pheonix
...



回答2:


You have to move between pages:

page_html = requests.get("https://www.realtor.com/realestateagents/phoenix_az//pg-"+ str(page), headers=headers)
#html parsing
page_soup = soup(page_html, "html.parser")

Also you have mistake with variable name it should be page_html



来源:https://stackoverflow.com/questions/64144929/webscraper-wont-loop-from-page-2-to-page-5

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!