How to split an XML file (on specific N node) while conserving a header and a footer with Python? [closed]

问题

I am new to Python and I don't know where to start with the solution of my problem.

Here is what I need to do: Read an XML file from a folder and split it on multiple XML files (in another folder) regarding a specific repetitive node (that would be input by the user) while keeping the header (what comes before that node) and the footer (what comes after the node).

Here is an example:

<?xml version="1.0"?>
<catalog catalogName="cat1" catalogType="bestsellers">
   <headerNode node="1">
      <param1>value1</param1>
      <param2>value2</param2>
   </headerNode>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   </book>
   <footerNode node="2">
      <param1>value1</param1>
      <param2>value2</param2>
   </footerNode>
</catalog>

So the purpose whould be to have 3 XML files (because we have 3 instances of "book" node) having the "headerNode" + 1 "book" + "footerNode".

The first file would be like this:

<?xml version="1.0"?>
<catalog catalogName="cat1" catalogType="bestsellers">
   <headerNode node="1">
      <param1>value1</param1>
      <param2>value2</param2>
   </headerNode>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <footerNode node="2">
      <param1>value1</param1>
      <param2>value2</param2>
   </footerNode>
</catalog>

The only constraint is that it needs to be done with "ElementTree" and not "lxml" library (because lxml is not included in the production dependencies).

EDIT: So here is the code based on the answer from "MK Ultra".

For now I modified it to pass two parameters to the script (first one is the name of the XML file without extension and second is the split node), and now I read the XML and generate the XML files on the same folder than the script. (and I use an index in the loop to name the folder)

import sys
import xml.etree.ElementTree as ET
import os

# Get the current directory
cwd = os.getcwd()
# Load the xml
doc = ET.parse(r"%s/%s.xml" % (cwd,sys.argv[1]))
root = doc.getroot()
# Get the header element
header = root.find("headerNode")
# Get the footer element
footer = root.find("footerNode")
# loop over the books and create the new xml file
for idx,book in enumerate(root.findall(sys.argv[2])):
    top = ET.Element(root.tag)
    top.append(header)
    top.append(book)
    top.append(footer)
    out_book = ET.ElementTree(top)
    # the output file name will be the ID of the book
    out_path = "%s/%s_%s.xml" % (cwd,sys.argv[1],idx)
    out_book.write(open(out_path, "wb"))

How can I make the "headerNode"/"footerNode" part generic? By this I mean that it would be "book" or something else like "novel", "paper", etc. The correct value would only be known by the user of the script (which is not me obviously) when running it.

EDIT2: Just modified the original file to add attributes to the "catalog" node because I cannot copy the attributes while creating the splitted files.

回答1:

The algorithm goes as follows,

parse your xml file and get your existing root
with that, form the base for for all books - that has the catalog with header and footer - new_root.
Now, iterate through the root tag to get all element with tag 'book'
Then, insert the book element to your new_root and write it to a file - here I've written to a file with name same as your id!

#question 2 - tag name as input from user!
tag_name=raw_input("Enter tag name:")
from xml.etree.ElementTree import ElementTree,parse,Element
root = parse('sample.xml').getroot()
new_root=Element(root.tag)
#question 1 - multiple header and footer!
new_root.extend(root.findall('.//headerNode'))
new_root.extend(root.findall('.//footerNode'))
for elem in root:
    if elem.tag == tag_name:
        new_root.insert(1,elem)
        #question 3 - write output to file!
        ElementTree(new_root).write(open('path/to/folder'+elem.get('id')+'.xml', 'wb'))
        new_root.remove(elem)

Sample Output:

File Name: bk101.xml

<catalog>
   <headerNode node="1">
      <param1>value1</param1>
      <param2>value2</param2>
   </headerNode>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <footerNode node="2">
      <param1>value1</param1>
      <param2>value2</param2>
   </footerNode>
</catalog>

Happy Coding!

回答2:

From the top of my head you can do something like this:

import xml.etree.ElementTree as ET

# Load the xml
doc = ET.parse(r"d:\books.xml")
root = doc.getroot()
# Get the header element
header = root.find("headerNode")
# Get the footer element
footer = root.find("footerNode")
# loop over the books and create the new xml file
for book in root.findall('book'):
    top = ET.Element(root.tag)
    top.append(header)
    top.append(book)
    top.append(footer)
    out_book = ET.ElementTree(top)
    # the output file name will be the ID of the book
    out_path = "%s.xml" % book.attrib["id"]
    out_book.write(open(out_path, "wb"))

来源：https://stackoverflow.com/questions/43436086/how-to-split-an-xml-file-on-specific-n-node-while-conserving-a-header-and-a-fo

标签

python

xml

split

nodes