BeautifulSoup XML to CSV

核能气质少年 提交于 2020-01-05 07:14:55

问题


The code below takes an xml files and parses it into csv file.

import openpyxl    
from bs4 import BeautifulSoup


with open('1last.xml') as f_input:
    soup = BeautifulSoup(f_input, 'lxml')

wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Sheet1"

ws.append(["Description", "num", "text"])

for description in soup.find_all("description"):
    ws.append(["", description['num'], description.text])

ws.append(["SetData", "x", "value", "xin", "xax"])

for setdata in soup.find_all("setdata"):
    ws.append(["", setdata.get('x', ''), setdata.get('value', ''), setdata.get('xin', ''), setdata.get('xax', '')])

wb.save(filename="1last.csv")

This is output

This is the XML file

<?xml version="1.0" encoding="utf-8"?>
<ProjectData>
<FINAL>
    <START id="ID0001" service_code="0x5196">
      <Docs Docs_type="START">
        <Rational>225196</Rational>
        <Qualify>6251960000A0DE</Qualify>
      </Docs>
      <Description num="1213f2312">The parameter</Description>
      <DataFile dg="12" dg_id="let">
        <SetData value="32" />
      </DataFile>
    </START>
    <START id="DG0003" service_code="0x517B">
      <Docs Docs_type="START">
        <Rational>23423</Rational>
        <Qualify>342342</Qualify>
      </Docs>
      <Description num="3423423f3423">The third</Description>
      <DataFile dg="55" dg_id="big">
        <SetData x="E1" value="21259" />
        <SetData x="E2" value="02" />
      </DataFile>
    </START>
    <START id="ID0048" service_code="0x5198">
      <RawData rawdata_type="OPDATA">
        <Rational>225198</Rational>
        <Qualify>343243324234234</Qualify>
      </RawData>
      <Description num="434234234">The forth</Description>
      <DataFile unit="21" unit_id="FEDS">
        <Ycross unit="ce" points="21" name="Thefiles" text_id="54" unit_id="98" 
        <SetData xin="5" xax="233" value="323" />
        <SetData xin="123" xax="77" value="555" />
        <SetData xin="17" xax="65" value="23" />
      </SetData>
    </START>
</FINAL>
</ProjectData>

Recently I have been trying to modify the code so it goes through all the children of START and parse them into columns. If one child element has more rows, it will parse into a new line just as what the code above does. Unfortunately, not successful and just stuck at this moment

This picture shows on how the output should look like.


回答1:


You can try something like this.

I have written the code for only a few of the tags. You can easily fill up the rest of the required tags similarly. Hope it helps!

Edited to add the set data tag values.

from xml.etree import ElementTree as ET
from collections import defaultdict
import csv

tree = ET.parse(StringIO(data))
root = tree.getroot()

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)

    start_nodes = root.findall('.//START')
    headers = ['id', 'service_code', 'rational', 'qualify', 'description_num', 'description_txt', 'set_data_xin', 'set_data_xax', 'set_data_value', 'set_data_x']
    writer.writerow(headers)
    for sn in start_nodes:
        row = defaultdict(str)

        for k,v in sn.attrib.items():
            row[k] = v

        for rn in sn.findall('.//Rational'):
            row['rational'] = rn.text

        for qu in sn.findall('.//Qualify'):
            row['qualify'] = qu.text

        for ds in sn.findall('.//Description'):
            row['description_txt'] = ds.text
            row['description_num'] = ds.attrib['num']

        # all other tags except set data must be parsed before this.
        for st in sn.findall('.//SetData'):
            for k,v in st.attrib.items():
                row['set_data_'+ str(k)] = v
            row_data = [row[i] for i in headers]
            writer.writerow(row_data)
            row = defaultdict(str)

Update

add

        for st in sn.findall('.//DataFile'):
            for k,v in st.attrib.items():
                row['datafile_'+ str(k)] = v 

        for st in sn.findall('.//Ycross'):
            for k,v in st.attrib.items():
                row['ycross_'+ str(k)] = v 

and the corresponding values to the headers list



来源:https://stackoverflow.com/questions/59514690/beautifulsoup-xml-to-csv

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!