Scraping multiple paginated links with BeautifulSoup and Requests

偶尔善良 提交于 2019-12-20 15:22:04

问题


Python Beginner here. I'm trying to scrape all products from one category on dabs.com. I've managed to scrape all products on a given page, but I'm having trouble iterating over all the paginated links.

Right now, I've tried to isolate all the pagination buttons with the span class='page-list" but even that isn't working. Ideally, I would like to make the crawler keep clicking next until it has scraped all products on all pages. How can I do this?

Really appreciate any input

from bs4 import BeautifulSoup

import requests

base_url = "http://www.dabs.com"
page_array = []

def get_pages():
    html = requests.get(base_url)
    soup = BeautifulSoup(html.content, "html.parser")

    page_list = soup.findAll('span', class="page-list")
    pages = page_list[0].findAll('a')

    for page in pages:
        page_array.append(page.get('href'))

def scrape_page(page):
    html = requests.get(base_url)
    soup = BeautifulSoup(html.content, "html.parser")
    Product_table = soup.findAll("table")
    Products = Product_table[0].findAll("tr")

    if len(soup.findAll('tr')) > 0:
        Products = Products[1:]

    for row in Products:
        cells = row.find_all('td')
        data = {
            'description' : cells[0].get_text(),
            'price' : cells[1].get_text()
        }
        print data

get_pages()
[scrape_page(base_url + page) for page in page_array]

回答1:


Their next page button has a title of "Next" you could do something like:

import requests
from bs4 import BeautifulSoup as bs

url = 'www.dabs.com/category/computing/11001/'
base_url = 'http://www.dabs.com'

r = requests.get(url)

soup = bs(r.text)
elm = soup.find('a', {'title': 'Next'})

next_page_link = base_url + elm['href']

Hope that helps.



来源:https://stackoverflow.com/questions/28597041/scraping-multiple-paginated-links-with-beautifulsoup-and-requests

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!