问题
I'm saving some specific tags from webpage to an Excel file so I have this code:
`import requests
from bs4 import BeautifulSoup
import openpyxl
url = "http://www.euro.com.pl/telewizory-led-lcd-plazmowe,strona-1.bhtml"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
wb = openpyxl.Workbook()
ws = wb.active
tagiterator = soup.h2
row, col = 1, 1
ws.cell(row=row, column=col, value=tagiterator.getText())
tagiterator = tagiterator.find_next()
while tagiterator.find_next():
if tagiterator.name == 'h2':
row += 1
col = 1
ws.cell(row=row, column=col, value=tagiterator.getText(strip=True))
elif tagiterator.name == 'span':
col += 1
ws.cell(row=row, column=col, value=tagiterator.getText(strip=True))
tagiterator = tagiterator.find_next()
wb.save('DG3test.xlsx')`
It works, but I want exclude some tags. I want to get only that h2 tags which have 'product-name' class and that span tags which have 'attribute-value' class. I tried to do this by:
tagiterator['class'] == 'product-name'
tagiterator.hasClass('product-name')
tagiterator.get
And some more which also didn't worked.
Values I want are visible in this poor image I created: https://ibb.co/eWLsoQ and url is in the code.
回答1:
What I did not include is writing it to an excel file, hopefully, that's something you can do, nevertheless, just write a comment and I'll include the code for this. Logic applies, write product information, add row+=1 and column then resets the column...(why do we do this? so the product stays within the same row :). something you've already done
from bs4 import BeautifulSoup
import requests
header = {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'}
url = requests.get("http://www.euro.com.pl/telewizory-led-lcd-plazmowe,strona-1.bhtml", headers=header).text
soup = BeautifulSoup(url, 'lxml')
find_products = soup.findAll('div',{'class':'product-row'})
for item in find_products:
title_text = item.find('div',{'class':'product-header'}).h2.a.text.strip() #Finds the title / name of product
# print(title_text)
display = item.find('span',{'class':'attribute-value'}).text.strip() #Finds for example the this text 49 cali, Full HD, 1920 x 1080
# print(display)
functions_item = item.findAll('span',{'class':'attribute-value'})[1] #We find now the functions or the 'Funkcje'
list_of_funcs = functions_item.findAll('a') #We find the list of the functions e.g. wifi
#Now you can store them or do-smt...
for funcs in list_of_funcs:
print(funcs.text.strip())
Algorithm:
- We find each product
- We find tags within each product and extract the relevant information
- We use the
.text
to extract only the text portion - We use for loops to iterate through each product and then iterate through their Functions or the tag that contains the capabilities of product.
来源:https://stackoverflow.com/questions/44608008/iterating-html-through-tag-classes-with-beautifulsoup