问题
I am trying to retrieve all categories and subcategories within a retail website. I am able to use BeautifulSoup to pull every single product in the category once I am in it. However, I am struggle with the loop for categories. I'm using this as a test website https://www.uniqlo.com/us/en/women
How do I loop through each category as well as the subcategories on the left side of the website? The problem is that you would have to click on the category before the website displays all the subcategories. I would like to extract all products within the category/subcategory into a csv file. This is what I have so far:
import bs4
import json
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myurl = 'https://www.uniqlo.com/us/en/women/'
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
filename = "products.csv"
file = open(filename,"w",newline='')
product_list = []
containers = page_soup.findAll("li",{"class" : lambda L: L and
L.startswith('grid-tile')}) #Find all li with class: grid-tile
for container in containers:
product_container = container.findAll("div",{"class":"product-swatches"})
product_names = product_container[0].findAll("li")
for i in range(len(product_names)):
try:
product_name = product_names[i].a.img.get("alt")
product_mod_name = product_name.split(',')[0].lstrip()
print(product_mod_name)
except:
product_name = ''
i +=1
product = [product_mod_name]
print(product)
product_list.append(product)
import csv
with open('products.csv','a',newline='') as file:
writer=csv.writer(file)
for row in product_list:
writer.writerow(row)
回答1:
You can try this script. It will go through different categories and subcategories of products and parse the title and price of them. There are several products with same names and the only difference between them are colors. So, don't count them as duplicate. I've written the script in a very compact manner so stretch it as per your comfortability:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://www.uniqlo.com/us/en/women')
soup = BeautifulSoup(res.text, "lxml")
for items in soup.select("#category-level-1 .refinement-link"):
page = requests.get(items['href'])
broth = BeautifulSoup(page.text,"lxml")
for links in broth.select("#category-level-2 .refinement-link"):
req = requests.get(links['href'])
sauce = BeautifulSoup(req.text,"lxml")
for data in sauce.select(".product-tile-info"):
title = data.select(".name-link")[0].text
price = ' '.join([item.text for item in data.select(".product-pricing span")])
print(title.strip(),price.strip())
Results are like:
WOMEN CASHMERE CREW NECK SWEATER $79.90
Women Extra Fine Merino Crew Neck Sweater $29.90 $19.90
WOMEN KAWS X PEANUTS LONG-SLEEVE HOODED SWEATSHIRT $19.90
来源:https://stackoverflow.com/questions/47567368/python-web-scraping-loop-through-all-categories-and-subcategories