问题
I have a huge XML and I want to remove unwanted tags from this. Ex.'
<orgs>
<org name="Test1">
<item>a</item>
<item>b</item>
</org>
<org name="Test2">
<item>c</item>
<item>b</item>
<item>e</item>
</org>
</orgs>
I want to remove all the <item>b</item> from this xml. Which parser api should be use for this as xml is very large and How can achieve it.
回答1:
One approach would be to use a Document Object Model (DOM), the draw back to this, as the name suggests, it needs to load the entire document into memory and Java's DOM API is quite memory hungry. The benefit is, you can take advantage of XPath to find the offending nodes
Take a closer look at Java API for XML Processing (JAXP) for more details and other APIs
Step: 1 Load the document
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(new File("..."));
Set 2: Find the offending nodes
XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression xExpress = xPath.compile("/orgs/org/item[text()='b']");
NodeList nodeList = (NodeList) xExpress.evaluate(doc.getDocumentElement(), XPathConstants.NODESET);
Set 3: Remove offending nodes
Okay, this is not as simple as it should be. Removing a node can leave a blank space in the document, which would be "nice" to clean up. The following method is a simple library method I adapted from some internet code(s) I found, which will remove the specified Node, but will also remove any white space/text nodes as well
public static void removeNode(Node node) {
if (node != null) {
while (node.hasChildNodes()) {
removeNode(node.getFirstChild());
}
Node parent = node.getParentNode();
if (parent != null) {
parent.removeChild(node);
NodeList childNodes = parent.getChildNodes();
if (childNodes.getLength() > 0) {
List<Node> lstTextNodes = new ArrayList<Node>(childNodes.getLength());
for (int index = 0; index < childNodes.getLength(); index++) {
Node childNode = childNodes.item(index);
if (childNode.getNodeType() == Node.TEXT_NODE) {
lstTextNodes.add(childNode);
}
}
for (Node txtNodes : lstTextNodes) {
removeNode(txtNodes);
}
}
}
}
}
Loop over the offending nodes...
for (int index = 0; index < nodeList.getLength(); index++) {
Node node = nodeList.item(index);
removeNode(node);
}
Step 4: Save the result
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.INDENT, "yes");
tf.setOutputProperty(OutputKeys.METHOD, "xml");
tf.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
DOMSource domSource = new DOMSource(doc);
StreamResult sr = new StreamResult(System.out);
tf.transform(domSource, sr);
Which outputs something like...
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<orgs>
<org name="Test1">
<item>a</item>
</org>
<org name="Test2">
<item>c</item>
<item>e</item>
</org>
</orgs>
回答2:
The standard way to do this is with XSLT. You need a stylesheet with two rules: an identity rule which copies things unchanged:
<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
and a second rule that drops the unwanted elements:
<xsl:template match="item[. = 'b']"/>
As with a DOM-based approach, this may give problems if your document is too big to go in memory. In XSLT 3.0 you can solve this with streaming. XSLT 3.0 also makes "identity" transformations easier to write, so the entire code now becomes:
<xsl:transform version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:mode streamable="yes" on-no-match="shallow-copy"/>
<xsl:template match="item[. = 'b']"/>
</xsl:transform>
回答3:
If your data does not fit into your memory, you need a pull parser that does not load the file all at once. If your data fit's into memory, there is a very short solution using data projection (a project that I'm affiliated with):
public class RemoveTags {
public interface Projection {
@XBDelete("//item[text()='b']")
void deleteAllItems();
}
public static void main(String[] args) throws IOException {
XBProjector projector = new XBProjector();
Projection projection = projector.io().file("data.xml").read(Projection.class);
projection.deleteAllItems();
projector.io().file("withoutItems.xml").write(projection);
}
}
来源:https://stackoverflow.com/questions/27978151/how-to-remove-unwanted-tags-from-xml