BeautifulSoup parsing XML to table

时光怂恿深爱的人放手 提交于 2021-02-11 14:34:03

问题


come back again with another issue. using BeautifulSoup really new in parsing XML , and have this problem since 2 weeks now. will appreciate your help have this structure :

<detail>
<page number="01">
    <Bloc code="AF" A="000000000002550" B="000000000002550"/>
    <Bloc code="AH" A="000000000035826" C="000000000035826" D="000000000035826"/>
    <Bloc code="AR" A="000000000026935" B="000000000024503" C="000000000002431" D="000000000001669"/>
</page>
<page number="02">
    <Bloc code="DA" A="000000000038486" B="000000000038486"/>
    <Bloc code="DD" A="000000000003849" B="000000000003849"/>
    <Bloc code="EA" A="000000000001029"/>
    <Bloc code="EC" A="000000000063797" B="000000000082427"/>
</page>
    <page number="03">
    <Bloc code="FD" C="000000000574042" D="000000000610740"/>
    <Bloc code="GW" C="000000000052677" D="000000000075362"/>
</page>
</detail>

this is my code:(i know that its so poor and have to improve it :'( )

if soup.find_all('bloc') != None:
for element in soup.find_all('bloc'):
    code_element = element['code']        
    if element.find('m1'):
        m1_element  = element['m1']
    else:
        None
    if element.find('m2'):
        m2_element  = element['m2']
    else:
        None
    print(code_element,m1_element, m2_element)

I ve got the error because the 'm2' element does not exist in all the pages. i dont know how can handle this issue.

i would like to put the result in DataFrame like this.

DatFrame = CODE     A/          B/           C/             D            Page--- Columns
           AF       0000002550  00002550     NULL           NULL         01
           AH       000035826   NULL         000035826      0000035826   01
           AR       000026935   000000024503 0000002431     0000001669   01
....etc.

Thank you so much for your help


回答1:


A list comprehension of bloc elements with an embedded dict comprehension of bloc attributes is the core. page by appending to dict of bloc attributes, navigating to parent and the required attribute.

Column order is based on order that they are seen

from bs4 import BeautifulSoup
xml = """<detail>
<page number="01">
    <Bloc code="AF" A="000000000002550" B="000000000002550"/>
    <Bloc code="AH" A="000000000035826" C="000000000035826" D="000000000035826"/>
    <Bloc code="AR" A="000000000026935" B="000000000024503" C="000000000002431" D="000000000001669"/>
</page>
<page number="02">
    <Bloc code="DA" A="000000000038486" B="000000000038486"/>
    <Bloc code="DD" A="000000000003849" B="000000000003849"/>
    <Bloc code="EA" A="000000000001029"/>
    <Bloc code="EC" A="000000000063797" B="000000000082427"/>
</page>
    <page number="03">
    <Bloc code="FD" C="000000000574042" D="000000000610740"/>
    <Bloc code="GW" C="000000000052677" D="000000000075362"/>
</page>
</detail>"""

soup = BeautifulSoup(xml)
df = pd.DataFrame([{**{k:b[k] for k in b.attrs.keys()},**{"page":b.parent["number"]}} 
                   for b in soup.find_all("bloc")])


output

code               a               b page               c               d
  AF 000000000002550 000000000002550   01             NaN             NaN
  AH 000000000035826             NaN   01 000000000035826 000000000035826
  AR 000000000026935 000000000024503   01 000000000002431 000000000001669
  DA 000000000038486 000000000038486   02             NaN             NaN
  DD 000000000003849 000000000003849   02             NaN             NaN
  EA 000000000001029             NaN   02             NaN             NaN
  EC 000000000063797 000000000082427   02             NaN             NaN
  FD             NaN             NaN   03 000000000574042 000000000610740
  GW             NaN             NaN   03 000000000052677 000000000075362

elementtree

Very similar to BeautifulSoup

import xml.etree.ElementTree as ET
root = ET.fromstring(xml)
df2 = pd.DataFrame([{**b.attrib, **{"page":p.attrib["number"]}} 
                    for p in root.iter("page") 
                    for b in p.iter("Bloc") ])



来源:https://stackoverflow.com/questions/65748393/beautifulsoup-parsing-xml-to-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!