问题
I have some non well-formed xml (HTML) data in JAVA, I used JAXP Dom, but It complains.
The Question is :Is there any way to use JAXP to parse such documents ??
I have a file containing data such as :
<employee>
<name value="ahmed" > <!-- note, this element is not closed, So it is not well-formed xml-->
</employee>
回答1:
Not really. JAXP wants well-formed markup. Have you considered the Cyberneko HTML Parser? We've been very successful with it at our shop.
EDIT: I see you are wanting to parse XML too. Hrmm.... Cyberneko works well for HTML but I don't know about others. It has a tag balancer that would close some tags off, but I don't know if you can train it to recognize tags that are not HTML.
回答2:
You could try running your document through the jtidy API first - that has the ability to convert html into valid xhtml: http://jtidy.sourceforge.net/howto.html
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parse(......)...
回答3:
You could use TagSoup. I have used it with great success. It is completely compatible with the Java XML APIs, including SAX, DOM, XSLT, and StAX. For example, here is how I used it to apply XSLT transforms to particularly poor HTML:
public static void transform(InputStream style, InputStream data)
throws SAXException, TransformerException {
XMLReader reader =
XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
Source input = new SAXSource(reader, new InputSource(data));
Source xsl = new StreamSource(style);
Transformer transformer =
TransformerFactory.newInstance().newTransformer(xsl);
transformer.transform(input, new StreamResult(System.out));
}
来源:https://stackoverflow.com/questions/2560783/i-need-to-parse-non-well-formed-xml-data-html