问题
come back again with another issue. using BeautifulSoup really new in parsing XML , and have this problem since 2 weeks now. will appreciate your help have this structure :
<detail>
<page number="01">
<Bloc code="AF" A="000000000002550" B="000000000002550"/>
<Bloc code="AH" A="000000000035826" C="000000000035826" D="000000000035826"/>
<Bloc code="AR" A="000000000026935" B="000000000024503" C="000000000002431" D="000000000001669"/>
</page>
<page number="02">
<Bloc code="DA" A="000000000038486" B="000000000038486"/>
<Bloc code="DD" A="000000000003849" B="000000000003849"/>
<Bloc code="EA" A="000000000001029"/>
<Bloc code="EC" A="000000000063797" B="000000000082427"/>
</page>
<page number="03">
<Bloc code="FD" C="000000000574042" D="000000000610740"/>
<Bloc code="GW" C="000000000052677" D="000000000075362"/>
</page>
</detail>
this is my code:(i know that its so poor and have to improve it :'( )
if soup.find_all('bloc') != None:
for element in soup.find_all('bloc'):
code_element = element['code']
if element.find('m1'):
m1_element = element['m1']
else:
None
if element.find('m2'):
m2_element = element['m2']
else:
None
print(code_element,m1_element, m2_element)
I ve got the error because the 'm2' element does not exist in all the pages. i dont know how can handle this issue.
i would like to put the result in DataFrame like this.
DatFrame = CODE A/ B/ C/ D Page--- Columns
AF 0000002550 00002550 NULL NULL 01
AH 000035826 NULL 000035826 0000035826 01
AR 000026935 000000024503 0000002431 0000001669 01
....etc.
Thank you so much for your help
回答1:
A list
comprehension of bloc elements with an embedded dict
comprehension of bloc attributes is the core. page by appending to dict
of bloc attributes, navigating to parent
and the required attribute.
Column order is based on order that they are seen
from bs4 import BeautifulSoup
xml = """<detail>
<page number="01">
<Bloc code="AF" A="000000000002550" B="000000000002550"/>
<Bloc code="AH" A="000000000035826" C="000000000035826" D="000000000035826"/>
<Bloc code="AR" A="000000000026935" B="000000000024503" C="000000000002431" D="000000000001669"/>
</page>
<page number="02">
<Bloc code="DA" A="000000000038486" B="000000000038486"/>
<Bloc code="DD" A="000000000003849" B="000000000003849"/>
<Bloc code="EA" A="000000000001029"/>
<Bloc code="EC" A="000000000063797" B="000000000082427"/>
</page>
<page number="03">
<Bloc code="FD" C="000000000574042" D="000000000610740"/>
<Bloc code="GW" C="000000000052677" D="000000000075362"/>
</page>
</detail>"""
soup = BeautifulSoup(xml)
df = pd.DataFrame([{**{k:b[k] for k in b.attrs.keys()},**{"page":b.parent["number"]}}
for b in soup.find_all("bloc")])
output
code a b page c d
AF 000000000002550 000000000002550 01 NaN NaN
AH 000000000035826 NaN 01 000000000035826 000000000035826
AR 000000000026935 000000000024503 01 000000000002431 000000000001669
DA 000000000038486 000000000038486 02 NaN NaN
DD 000000000003849 000000000003849 02 NaN NaN
EA 000000000001029 NaN 02 NaN NaN
EC 000000000063797 000000000082427 02 NaN NaN
FD NaN NaN 03 000000000574042 000000000610740
GW NaN NaN 03 000000000052677 000000000075362
elementtree
Very similar to BeautifulSoup
import xml.etree.ElementTree as ET
root = ET.fromstring(xml)
df2 = pd.DataFrame([{**b.attrib, **{"page":p.attrib["number"]}}
for p in root.iter("page")
for b in p.iter("Bloc") ])
来源:https://stackoverflow.com/questions/65748393/beautifulsoup-parsing-xml-to-table