问题
I'm parsing data about car production with BeautifulSoup (see also my first question):
from bs4 import BeautifulSoup
import string
html = """
<h4>Production Capacity (year)</h4>
<div class="profile-area">
Vehicle 1,140,000 units /year
</div>
<h4>Output</h4>
<div class="profile-area">
Vehicle 809,000 units ( 2016 )
</div>
<div class="profile-area">
Vehicle 815,000 units ( 2015 )
</div>
<div class="profile-area">
Vehicle 836,000 units ( 2014 )
</div>
<div class="profile-area">
Vehicle 807,000 units ( 2013 )
</div>
<div class="profile-area">
Vehicle 760,000 units ( 2012 )
</div>
<div class="profile-area">
Vehicle 805,000 units ( 2011 )
</div>
"""
soup = BeautifulSoup(html, 'lxml')
for item in soup.select("div.profile-area"):
produkz = item.text.strip()
produkz = produkz.replace('\n',':')
prev_h4 = str(item.find_previous_sibling('h4'))
if "Models" in prev_h4:
models=produkz
else:
models=""
if "Capacity" in prev_h4:
capacity=produkz
else:
capacity=""
if "( 2015 )" in produkz:
prod15=produkz
else:
prod15=""
if "( 2016 )" in produkz:
prod16=produkz
else:
prod16=""
if "( 2017 )" in produkz:
prod17=produkz
else:
prod17=""
print(models+';'+capacity+';'+prod15+';'+prod16+';'+prod17)
My problem is, that the next loop on all matching HTML occurrences ("div.profile-area") overwrites my result:
;Vehicle 1,140,000 units /year;;;;;;
;;;;;;Vehicle 809,000 units ( 2016 );
;;;;;Vehicle 815,000 units ( 2015 );;
;;;;Vehicle 836,000 units ( 2014 );;;
;;;Vehicle 807,000 units ( 2013 );;;;
;;Vehicle 760,000 units ( 2012 );;;;;
;;;;;;;
My desired result is:
;Vehicle 1,140,000 units /year;Vehicle 760,000 units ( 2012 );Vehicle 807,000 units ( 2013 );Vehicle 836,000 units ( 2014 );Vehicle 815,000 units ( 2015 );Vehicle 809,000 units ( 2016 );
I would be glad if you could show me a better way to structure my code. Thanks in advance.
回答1:
I would suggest you store each entry in a dictionary, you can then extract the fields you want easily at the end (you don't seem to want 2011?):
from bs4 import BeautifulSoup
import re
html = """
<h4>Production Capacity (year)</h4>
<div class="profile-area">
Vehicle 1,140,000 units /year
</div>
<h4>Output</h4>
<div class="profile-area">
Vehicle 809,000 units ( 2016 )
</div>
<div class="profile-area">
Vehicle 815,000 units ( 2015 )
</div>
<div class="profile-area">
Vehicle 836,000 units ( 2014 )
</div>
<div class="profile-area">
Vehicle 807,000 units ( 2013 )
</div>
<div class="profile-area">
Vehicle 760,000 units ( 2012 )
</div>
<div class="profile-area">
Vehicle 805,000 units ( 2011 )
</div>
"""
soup = BeautifulSoup(html, 'lxml')
units = {}
for item in soup.find_all(['h4', 'div']):
if item.name == 'h4':
for h4 in ['capacity', 'output', 'models']:
if h4 in item.text.lower():
break
elif item.get('class', [''])[0] == 'profile-area':
vehicle = item.get_text(strip=True)
if h4 == 'output':
re_year = re.search(r'\( (\d+) \)', vehicle)
if re_year:
year = re_year.group(1)
else:
year = 'unknown'
units[year] = vehicle
else:
units[h4] = vehicle
req_fields = ['models', 'capacity', '2012', '2013', '2014', '2015', '2016']
print(';'.join([units.get(field, '') for field in req_fields]))
This would display:
;Vehicle 1,140,000 units /year;Vehicle 760,000 units ( 2012 );Vehicle 807,000 units ( 2013 );Vehicle 836,000 units ( 2014 );Vehicle 815,000 units ( 2015 );Vehicle 809,000 units ( 2016 )
A regular expression is used to extract the year from the vehicle entry. This is then used as the key in the dictionary.
For the HTML in pastebin it gives:
Volkswagen Golf, Golf Variant(Estate), Golf Plus, CrossGolf (2006-), e-Golf (2014-)Volkswagen Touran, CrossTouran (2007-), Tiguan (2007-);I.D. electric vehicles based on MEB (planning);SEAT new SUV MQB-A2 platform (2018- planning);Components:press shop, chassis, plastics technology;Vehicle 1,140,000 units /year;Vehicle 760,000 units ( 2012 );Vehicle 807,000 units ( 2013 );Vehicle 836,000 units ( 2014 );Vehicle 815,000 units ( 2015 );Vehicle 809,000 units ( 2016 )
回答2:
This is my solution, You need to take care of each element tag and parse it accordingly. I went further to your problem and offered a more flexible way to access each data value. hope it helps.
import re
from bs4 import BeautifulSoup
html_doc = """
<h4>Production Capacity (year)</h4>
<div class="profile-area">
Vehicle 1,140,000 units /year
</div>
<h4>Output</h4>
<div class="profile-area">
Vehicle 809,000 units ( 2016 )
</div>
<div class="profile-area">
Vehicle 815,000 units ( 2015 )
</div>
<div class="profile-area">
Vehicle 836,000 units ( 2014 )
</div>
<div class="profile-area">
Vehicle 807,000 units ( 2013 )
</div>
<div class="profile-area">
Vehicle 760,000 units ( 2012 )
</div>
<div class="profile-area">
Vehicle 805,000 units ( 2011 )
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
h4_elements = soup.find_all('h4')
profile_areas = soup.find_all('div', attrs={'class': 'profile-area'})
print('\n')
print("++++++++++++++++++++++++++++++++++++")
print("Element counts")
print("++++++++++++++++++++++++++++++++++++")
print("Total H4: {}".format(len(h4_elements)))
print("++++++++++++++++++++++++++++++++++++")
print("Total profile-area: {}".format(len(profile_areas)))
print("++++++++++++++++++++++++++++++++++++")
print('\n')
for i in h4_elements:
print("++++++++++++++++++++++++++++++++++++")
print(i.text.rstrip().lstrip())
print("++++++++++++++++++++++++++++++++++++")
del profile_areas[0]
for j in profile_areas:
raw = re.sub('[^A-Za-z0-9]+', ' ', j.text.replace(',','').lstrip().rstrip())
raw = raw.rstrip()
el = raw.split(' ')
print('Type: {} '.format(el[0]))
print('Sold: {} {} '.format(el[1], el[2]))
print('Year: {} '.format(el[3]))
print("++++++++++++++++++++++++++++++++++++")
The output is the following:
++++++++++++++++++++++++++++++++++++
Production Capacity (year)
++++++++++++++++++++++++++++++++++++
Type:Vehicle
Sold: 809000 units
Year: 2016
++++++++++++++++++++++++++++++++++++
Type:Vehicle
Sold: 815000 units
Year: 2015
++++++++++++++++++++++++++++++++++++
Type:Vehicle
Sold: 836000 units
Year: 2014
++++++++++++++++++++++++++++++++++++
Type:Vehicle
Sold: 807000 units
Year: 2013
++++++++++++++++++++++++++++++++++++
Type:Vehicle
Sold: 760000 units
Year: 2012
++++++++++++++++++++++++++++++++++++
Type:Vehicle
Sold: 805000 units
Year: 2011
++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++
Output
++++++++++++++++++++++++++++++++++++
Type:Vehicle
Sold: 815000 units
Year: 2015
++++++++++++++++++++++++++++++++++++
Type:Vehicle
Sold: 836000 units
Year: 2014
++++++++++++++++++++++++++++++++++++
Type:Vehicle
Sold: 807000 units
Year: 2013
++++++++++++++++++++++++++++++++++++
Type:Vehicle
Sold: 760000 units
Year: 2012
++++++++++++++++++++++++++++++++++++
Type:Vehicle
Sold: 805000 units
Year: 2011
++++++++++++++++++++++++++++++++++++
来源:https://stackoverflow.com/questions/50710290/python-how-to-access-and-iterate-over-a-list-of-div-class-element-using-beauti