How to remove unwanted tags from XML

最后都变了- 提交于 2019-12-23 21:05:48

问题


I have a huge XML and I want to remove unwanted tags from this. Ex.'

<orgs>
    <org name="Test1">
        <item>a</item>
        <item>b</item>
    </org>
    <org name="Test2">
        <item>c</item>
        <item>b</item>
        <item>e</item>
    </org>
</orgs>

I want to remove all the <item>b</item> from this xml. Which parser api should be use for this as xml is very large and How can achieve it.


回答1:


One approach would be to use a Document Object Model (DOM), the draw back to this, as the name suggests, it needs to load the entire document into memory and Java's DOM API is quite memory hungry. The benefit is, you can take advantage of XPath to find the offending nodes

Take a closer look at Java API for XML Processing (JAXP) for more details and other APIs

Step: 1 Load the document

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(new File("..."));

Set 2: Find the offending nodes

XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression xExpress = xPath.compile("/orgs/org/item[text()='b']");
NodeList nodeList = (NodeList) xExpress.evaluate(doc.getDocumentElement(), XPathConstants.NODESET);

Set 3: Remove offending nodes

Okay, this is not as simple as it should be. Removing a node can leave a blank space in the document, which would be "nice" to clean up. The following method is a simple library method I adapted from some internet code(s) I found, which will remove the specified Node, but will also remove any white space/text nodes as well

public static void removeNode(Node node) {
    if (node != null) {
        while (node.hasChildNodes()) {
            removeNode(node.getFirstChild());
        }

        Node parent = node.getParentNode();
        if (parent != null) {
            parent.removeChild(node);
            NodeList childNodes = parent.getChildNodes();
            if (childNodes.getLength() > 0) {
                List<Node> lstTextNodes = new ArrayList<Node>(childNodes.getLength());
                for (int index = 0; index < childNodes.getLength(); index++) {
                    Node childNode = childNodes.item(index);
                    if (childNode.getNodeType() == Node.TEXT_NODE) {
                        lstTextNodes.add(childNode);
                    }
                }
                for (Node txtNodes : lstTextNodes) {
                    removeNode(txtNodes);
                }
            }
        }
    }
}

Loop over the offending nodes...

for (int index = 0; index < nodeList.getLength(); index++) {
    Node node = nodeList.item(index);
    removeNode(node);
}

Step 4: Save the result

Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.INDENT, "yes");
tf.setOutputProperty(OutputKeys.METHOD, "xml");
tf.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");

DOMSource domSource = new DOMSource(doc);
StreamResult sr = new StreamResult(System.out);
tf.transform(domSource, sr);

Which outputs something like...

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<orgs>
  <org name="Test1">
    <item>a</item>
  </org>
  <org name="Test2">
    <item>c</item>
    <item>e</item>
  </org>
</orgs>



回答2:


The standard way to do this is with XSLT. You need a stylesheet with two rules: an identity rule which copies things unchanged:

<xsl:template match="*">
  <xsl:copy>
    <xsl:copy-of select="@*"/>
    <xsl:apply-templates/>
  </xsl:copy>
</xsl:template>

and a second rule that drops the unwanted elements:

<xsl:template match="item[. = 'b']"/>

As with a DOM-based approach, this may give problems if your document is too big to go in memory. In XSLT 3.0 you can solve this with streaming. XSLT 3.0 also makes "identity" transformations easier to write, so the entire code now becomes:

<xsl:transform version="3.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:mode streamable="yes" on-no-match="shallow-copy"/>
  <xsl:template match="item[. = 'b']"/>
</xsl:transform>



回答3:


If your data does not fit into your memory, you need a pull parser that does not load the file all at once. If your data fit's into memory, there is a very short solution using data projection (a project that I'm affiliated with):

public class RemoveTags {

    public interface Projection {
        @XBDelete("//item[text()='b']")
        void deleteAllItems();
    }

    public static void main(String[] args) throws IOException {
        XBProjector projector = new XBProjector();
        Projection projection = projector.io().file("data.xml").read(Projection.class);
        projection.deleteAllItems();
        projector.io().file("withoutItems.xml").write(projection);
    }

}


来源:https://stackoverflow.com/questions/27978151/how-to-remove-unwanted-tags-from-xml

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!