Iterating html through tag classes with BeautifulSoup

问题

I'm saving some specific tags from webpage to an Excel file so I have this code:

`import requests
from bs4 import BeautifulSoup
import openpyxl

url = "http://www.euro.com.pl/telewizory-led-lcd-plazmowe,strona-1.bhtml"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")

wb = openpyxl.Workbook()
ws = wb.active

tagiterator = soup.h2

row, col = 1, 1
ws.cell(row=row, column=col, value=tagiterator.getText())
tagiterator = tagiterator.find_next()

while tagiterator.find_next():
    if tagiterator.name == 'h2':
        row += 1
        col = 1
        ws.cell(row=row, column=col, value=tagiterator.getText(strip=True))
    elif tagiterator.name == 'span':
        col += 1
        ws.cell(row=row, column=col, value=tagiterator.getText(strip=True))
tagiterator = tagiterator.find_next()

wb.save('DG3test.xlsx')`

It works, but I want exclude some tags. I want to get only that h2 tags which have 'product-name' class and that span tags which have 'attribute-value' class. I tried to do this by:

tagiterator['class'] == 'product-name'

tagiterator.hasClass('product-name')

tagiterator.get

And some more which also didn't worked.

Values I want are visible in this poor image I created: https://ibb.co/eWLsoQ and url is in the code.

回答1:

What I did not include is writing it to an excel file, hopefully, that's something you can do, nevertheless, just write a comment and I'll include the code for this. Logic applies, write product information, add row+=1 and column then resets the column...(why do we do this? so the product stays within the same row :). something you've already done

from bs4 import BeautifulSoup

import requests

header = {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'}


url = requests.get("http://www.euro.com.pl/telewizory-led-lcd-plazmowe,strona-1.bhtml", headers=header).text
soup = BeautifulSoup(url, 'lxml')

find_products = soup.findAll('div',{'class':'product-row'})

for item in find_products:
    title_text = item.find('div',{'class':'product-header'}).h2.a.text.strip() #Finds the title / name of product
    # print(title_text)
    display = item.find('span',{'class':'attribute-value'}).text.strip() #Finds for example the this text 49 cali, Full HD, 1920 x 1080
    # print(display)
    functions_item = item.findAll('span',{'class':'attribute-value'})[1] #We find now the functions or the 'Funkcje'
    list_of_funcs = functions_item.findAll('a') #We find the list of the functions e.g. wifi
    #Now you can store them or do-smt...

    for funcs in list_of_funcs:
        print(funcs.text.strip())

Algorithm:

We find each product
We find tags within each product and extract the relevant information
We use the .text to extract only the text portion
We use for loops to iterate through each product and then iterate through their Functions or the tag that contains the capabilities of product.

来源：https://stackoverflow.com/questions/44608008/iterating-html-through-tag-classes-with-beautifulsoup

标签

html

python-2.7

beautifulsoup