BeautifulSoup XML to CSV

问题

The code below takes an xml files and parses it into csv file.

import openpyxl    
from bs4 import BeautifulSoup


with open('1last.xml') as f_input:
    soup = BeautifulSoup(f_input, 'lxml')

wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Sheet1"

ws.append(["Description", "num", "text"])

for description in soup.find_all("description"):
    ws.append(["", description['num'], description.text])

ws.append(["SetData", "x", "value", "xin", "xax"])

for setdata in soup.find_all("setdata"):
    ws.append(["", setdata.get('x', ''), setdata.get('value', ''), setdata.get('xin', ''), setdata.get('xax', '')])

wb.save(filename="1last.csv")

This is output

This is the XML file

<?xml version="1.0" encoding="utf-8"?>
<ProjectData>
<FINAL>
    <START id="ID0001" service_code="0x5196">
      <Docs Docs_type="START">
        <Rational>225196</Rational>
        <Qualify>6251960000A0DE</Qualify>
      </Docs>
      <Description num="1213f2312">The parameter</Description>
      <DataFile dg="12" dg_id="let">
        <SetData value="32" />
      </DataFile>
    </START>
    <START id="DG0003" service_code="0x517B">
      <Docs Docs_type="START">
        <Rational>23423</Rational>
        <Qualify>342342</Qualify>
      </Docs>
      <Description num="3423423f3423">The third</Description>
      <DataFile dg="55" dg_id="big">
        <SetData x="E1" value="21259" />
        <SetData x="E2" value="02" />
      </DataFile>
    </START>
    <START id="ID0048" service_code="0x5198">
      <RawData rawdata_type="OPDATA">
        <Rational>225198</Rational>
        <Qualify>343243324234234</Qualify>
      </RawData>
      <Description num="434234234">The forth</Description>
      <DataFile unit="21" unit_id="FEDS">
        <Ycross unit="ce" points="21" name="Thefiles" text_id="54" unit_id="98" 
        <SetData xin="5" xax="233" value="323" />
        <SetData xin="123" xax="77" value="555" />
        <SetData xin="17" xax="65" value="23" />
      </SetData>
    </START>
</FINAL>
</ProjectData>

Recently I have been trying to modify the code so it goes through all the children of START and parse them into columns. If one child element has more rows, it will parse into a new line just as what the code above does. Unfortunately, not successful and just stuck at this moment

This picture shows on how the output should look like.

回答1:

You can try something like this.

I have written the code for only a few of the tags. You can easily fill up the rest of the required tags similarly. Hope it helps!

Edited to add the set data tag values.

from xml.etree import ElementTree as ET
from collections import defaultdict
import csv

tree = ET.parse(StringIO(data))
root = tree.getroot()

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)

    start_nodes = root.findall('.//START')
    headers = ['id', 'service_code', 'rational', 'qualify', 'description_num', 'description_txt', 'set_data_xin', 'set_data_xax', 'set_data_value', 'set_data_x']
    writer.writerow(headers)
    for sn in start_nodes:
        row = defaultdict(str)

        for k,v in sn.attrib.items():
            row[k] = v

        for rn in sn.findall('.//Rational'):
            row['rational'] = rn.text

        for qu in sn.findall('.//Qualify'):
            row['qualify'] = qu.text

        for ds in sn.findall('.//Description'):
            row['description_txt'] = ds.text
            row['description_num'] = ds.attrib['num']

        # all other tags except set data must be parsed before this.
        for st in sn.findall('.//SetData'):
            for k,v in st.attrib.items():
                row['set_data_'+ str(k)] = v
            row_data = [row[i] for i in headers]
            writer.writerow(row_data)
            row = defaultdict(str)

Update

add

        for st in sn.findall('.//DataFile'):
            for k,v in st.attrib.items():
                row['datafile_'+ str(k)] = v 

        for st in sn.findall('.//Ycross'):
            for k,v in st.attrib.items():
                row['ycross_'+ str(k)] = v

and the corresponding values to the headers list

来源：https://stackoverflow.com/questions/59514690/beautifulsoup-xml-to-csv

标签

python

xml

pandas

beautifulsoup