Parsing XML file containing HTML entities in Java without changing the XML

前端 未结 6 1276
一个人的身影
一个人的身影 2020-12-05 18:53

I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as , > and so forth. I

6条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-05 19:19

    Try this using org.apache.commons package :

    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder parser = dbf.newDocumentBuilder();
    
    InputStream in = new FileInputStream(xmlfile);    
    String unescapeHtml4 = IOUtils.toString(in);
    
    CharSequenceTranslator obj = new AggregateTranslator(new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE()),
              new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE())    
             );
    
    unescapeHtml4 = obj.translate(unescapeHtml4);
    StringReader readerInput= new StringReader(unescapeHtml4);
    
    InputSource is = new InputSource(readerInput);
    Document doc    = parser.parse(is);    
    

提交回复
热议问题