Keep numeric character entity characters such as `

` when parsing XML in Java

主宰稳场 提交于 2019-11-29 11:13:52

You need to escape all the XML entities before parsing the file into a Document. You do that by escaping the ampersand & itself with its corresponding XML entity &. Something like,

DocumentBuilder documentBuilder =
        DocumentBuilderFactory.newInstance().newDocumentBuilder();

String xmlContents = new String(Files.readAllBytes(Paths.get("demo.xml")), "UTF-8");

Document document = documentBuilder.parse(
         new InputSource(new StringReader(xmlContents.replaceAll("&", "&"))
        ));

Output :

2A string followed by special symbols 
  


P.S. This is complement of Ravi Thapliyal's answer, not an alternative.

I am having the same problem with handling an XML file which is exported from 2003 format Excelsheet. This XML file stores line-breaks in text contents as 
 along with other numeric character references. However, after reading it with Java DOM parser, manipulating the content of some elements and transforming it back to the XML file, I see that all the numeric character references are expanded (i.e. The line-break is converted to CRLF) in Windows with J2SE1.6. Since my goal is to keep the content format unchanged as much as possible while manipulating some elements (i.e. retain numeric character references), Ravi Thapliyal's suggestion seems to be the only working solution.

When writing the XML content back to the file, it is necessary to replace all & with &, right? To do that, I had to give a StringWriter to the transformer as StreamResult and obtain String from it, replace all and dump the string to the xml file.

TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
DOMSource source = new DOMSource(document);

//write into a stringWriter for further processing.
StringWriter stringWriter = new StringWriter();
StreamResult result = new StreamResult(stringWriter);

t.transform(source, result);

//stringWriter stream contains xml content.
String xmlContent = stringWriter.getBuffer().toString();
//revert "&" back to "&" to retain numeric character references.
xmlContent = xmlContent.replaceAll("&", "&");

BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"));
wr.write(xmlContent);
wr.close();
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!