Parsing XML file containing HTML entities in Java without changing the XML

前端未结

关注

 6  1281

一个人的身影 2020-12-05 18:53

I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —, > and so forth. I

6条回答

野趣味 (楼主)

2020-12-05 19:19

Try this using org.apache.commons package :

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();

InputStream in = new FileInputStream(xmlfile);    
String unescapeHtml4 = IOUtils.toString(in);

CharSequenceTranslator obj = new AggregateTranslator(new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE()),
          new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE())    
         );

unescapeHtml4 = obj.translate(unescapeHtml4);
StringReader readerInput= new StringReader(unescapeHtml4);

InputSource is = new InputSource(readerInput);
Document doc    = parser.parse(is);

0 讨论(0)

查看其它6个回答