How to use JAXB with HTML?

血红的双手。 提交于 2019-12-01 06:54:54

问题


I would like to unmarshall some nasty HTML to a Java object using JAXB. (I'm on Java 7).

Tagsoup is a SAX-compliant XML parser that can handle nasty HTML.

How can I setup JAXB to use Tagsoup for unmarshalling HTML?

I tried setting System.setProperty("org.xml.sax.driver", "org.ccil.cowan.tagsoup.Parser");

If I create an XMLReader, it uses Tagsoup, but not when I use JAXB.

  1. Does com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl use DOM or SAX for parsing XML?

  2. How can I tell JAXB to use SAX?

  3. How can I tell JAXB to use TagSoup as it's SAX implementation?

As per Blaise's suggesting, tried below, but getting SAXParseException on the last line. The parse is fine when done with the XMLReader only:

    JAXBContext jaxbContext = JAXBContext.newInstance(Thing.class);
    Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();

    XMLReader xmlReader = new org.ccil.cowan.tagsoup.Parser();

    xmlReader.parse("file:///c:/test.xml");
    System.out.println("parse ok");

    xmlReader.setContentHandler(unmarshaller.getUnmarshallerHandler());

    //SAXParseException; systemId: file:/c:/test.xml; lineNumber: 5; columnNumber: 3; The element type "br" must be terminated by the matching end-tag "</br>".
    Thing thing = (Thing) unmarshaller.unmarshal(new File("c:/test.xml"));

回答1:


You can get an UnmarshallerHandler from an Unmarshaller and set that as the ContentHandler on your SAX parser. After you do the SAX parse obtain the object from the UnmarshallerHandler.

UnmarshallerHandler unmarshallerHandler = unmarshaller.getUnmarshallerHandler();
xmlReader.setContentHandler(unmarshallerHandler);
xmlReader.parse(...);
Thing thing = (Thing) unmarshallerHandler.getResult();

There is an example of this on my blog:

  • http://blog.bdoughan.com/2011/05/jaxb-and-dtd.html


来源:https://stackoverflow.com/questions/24791422/how-to-use-jaxb-with-html

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!