问题
The code below takes an xml files and parses it into csv file.
import openpyxl
from bs4 import BeautifulSoup
with open('1last.xml') as f_input:
soup = BeautifulSoup(f_input, 'lxml')
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Sheet1"
ws.append(["Description", "num", "text"])
for description in soup.find_all("description"):
ws.append(["", description['num'], description.text])
ws.append(["SetData", "x", "value", "xin", "xax"])
for setdata in soup.find_all("setdata"):
ws.append(["", setdata.get('x', ''), setdata.get('value', ''), setdata.get('xin', ''), setdata.get('xax', '')])
wb.save(filename="1last.csv")
This is output
This is the XML file
<?xml version="1.0" encoding="utf-8"?>
<ProjectData>
<FINAL>
<START id="ID0001" service_code="0x5196">
<Docs Docs_type="START">
<Rational>225196</Rational>
<Qualify>6251960000A0DE</Qualify>
</Docs>
<Description num="1213f2312">The parameter</Description>
<DataFile dg="12" dg_id="let">
<SetData value="32" />
</DataFile>
</START>
<START id="DG0003" service_code="0x517B">
<Docs Docs_type="START">
<Rational>23423</Rational>
<Qualify>342342</Qualify>
</Docs>
<Description num="3423423f3423">The third</Description>
<DataFile dg="55" dg_id="big">
<SetData x="E1" value="21259" />
<SetData x="E2" value="02" />
</DataFile>
</START>
<START id="ID0048" service_code="0x5198">
<RawData rawdata_type="OPDATA">
<Rational>225198</Rational>
<Qualify>343243324234234</Qualify>
</RawData>
<Description num="434234234">The forth</Description>
<DataFile unit="21" unit_id="FEDS">
<Ycross unit="ce" points="21" name="Thefiles" text_id="54" unit_id="98"
<SetData xin="5" xax="233" value="323" />
<SetData xin="123" xax="77" value="555" />
<SetData xin="17" xax="65" value="23" />
</SetData>
</START>
</FINAL>
</ProjectData>
Recently I have been trying to modify the code so it goes through all the children of START and parse them into columns. If one child element has more rows, it will parse into a new line just as what the code above does. Unfortunately, not successful and just stuck at this moment
This picture shows on how the output should look like.
回答1:
You can try something like this.
I have written the code for only a few of the tags. You can easily fill up the rest of the required tags similarly. Hope it helps!
Edited to add the set data tag values.
from xml.etree import ElementTree as ET
from collections import defaultdict
import csv
tree = ET.parse(StringIO(data))
root = tree.getroot()
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
start_nodes = root.findall('.//START')
headers = ['id', 'service_code', 'rational', 'qualify', 'description_num', 'description_txt', 'set_data_xin', 'set_data_xax', 'set_data_value', 'set_data_x']
writer.writerow(headers)
for sn in start_nodes:
row = defaultdict(str)
for k,v in sn.attrib.items():
row[k] = v
for rn in sn.findall('.//Rational'):
row['rational'] = rn.text
for qu in sn.findall('.//Qualify'):
row['qualify'] = qu.text
for ds in sn.findall('.//Description'):
row['description_txt'] = ds.text
row['description_num'] = ds.attrib['num']
# all other tags except set data must be parsed before this.
for st in sn.findall('.//SetData'):
for k,v in st.attrib.items():
row['set_data_'+ str(k)] = v
row_data = [row[i] for i in headers]
writer.writerow(row_data)
row = defaultdict(str)
Update
add
for st in sn.findall('.//DataFile'):
for k,v in st.attrib.items():
row['datafile_'+ str(k)] = v
for st in sn.findall('.//Ycross'):
for k,v in st.attrib.items():
row['ycross_'+ str(k)] = v
and the corresponding values to the headers list
来源:https://stackoverflow.com/questions/59514690/beautifulsoup-xml-to-csv