Python web scraping - Loop through all categories and subcategories

问题

I am trying to retrieve all categories and subcategories within a retail website. I am able to use BeautifulSoup to pull every single product in the category once I am in it. However, I am struggle with the loop for categories. I'm using this as a test website https://www.uniqlo.com/us/en/women

How do I loop through each category as well as the subcategories on the left side of the website? The problem is that you would have to click on the category before the website displays all the subcategories. I would like to extract all products within the category/subcategory into a csv file. This is what I have so far:

import bs4
import json
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

myurl = 'https://www.uniqlo.com/us/en/women/'
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
filename = "products.csv"
file = open(filename,"w",newline='')
product_list = []

containers = page_soup.findAll("li",{"class" : lambda L: L and 
L.startswith('grid-tile')})   #Find all li with class: grid-tile

for container in containers: 

product_container = container.findAll("div",{"class":"product-swatches"})   
product_names = product_container[0].findAll("li")

    for i in range(len(product_names)):

    try:
        product_name = product_names[i].a.img.get("alt")
        product_mod_name = product_name.split(',')[0].lstrip()
        print(product_mod_name)
    except:
        product_name = ''

    i +=1    

product = [product_mod_name]
print(product)    
product_list.append(product)

import csv

with open('products.csv','a',newline='') as file:        
    writer=csv.writer(file)
    for row in product_list:
        writer.writerow(row)

回答1:

You can try this script. It will go through different categories and subcategories of products and parse the title and price of them. There are several products with same names and the only difference between them are colors. So, don't count them as duplicate. I've written the script in a very compact manner so stretch it as per your comfortability:

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.uniqlo.com/us/en/women')
soup = BeautifulSoup(res.text, "lxml")

for items in soup.select("#category-level-1 .refinement-link"):
    page = requests.get(items['href'])
    broth = BeautifulSoup(page.text,"lxml")

    for links in broth.select("#category-level-2 .refinement-link"):
        req = requests.get(links['href'])
        sauce = BeautifulSoup(req.text,"lxml")

        for data in sauce.select(".product-tile-info"):
            title = data.select(".name-link")[0].text
            price = ' '.join([item.text for item in data.select(".product-pricing span")])
            print(title.strip(),price.strip())

Results are like:

WOMEN CASHMERE CREW NECK SWEATER $79.90
Women Extra Fine Merino Crew Neck Sweater $29.90 $19.90
WOMEN KAWS X PEANUTS LONG-SLEEVE HOODED SWEATSHIRT $19.90

来源：https://stackoverflow.com/questions/47567368/python-web-scraping-loop-through-all-categories-and-subcategories

标签

python

beautifulsoup