I need to parse non well-formed xml data (HTML)

早过忘川 提交于 2019-12-07 05:42:25

问题


I have some non well-formed xml (HTML) data in JAVA, I used JAXP Dom, but It complains.

The Question is :Is there any way to use JAXP to parse such documents ??

I have a file containing data such as :

<employee>
 <name value="ahmed" > <!-- note, this element is not closed, So it is not well-formed xml-->
</employee>

回答1:


Not really. JAXP wants well-formed markup. Have you considered the Cyberneko HTML Parser? We've been very successful with it at our shop.

EDIT: I see you are wanting to parse XML too. Hrmm.... Cyberneko works well for HTML but I don't know about others. It has a tag balancer that would close some tags off, but I don't know if you can train it to recognize tags that are not HTML.




回答2:


You could try running your document through the jtidy API first - that has the ability to convert html into valid xhtml: http://jtidy.sourceforge.net/howto.html

Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parse(......)...



回答3:


You could use TagSoup. I have used it with great success. It is completely compatible with the Java XML APIs, including SAX, DOM, XSLT, and StAX. For example, here is how I used it to apply XSLT transforms to particularly poor HTML:

public static void transform(InputStream style, InputStream data)
        throws SAXException, TransformerException {
    XMLReader reader =
        XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
    Source input = new SAXSource(reader, new InputSource(data));
    Source xsl = new StreamSource(style);
    Transformer transformer =
        TransformerFactory.newInstance().newTransformer(xsl);
    transformer.transform(input, new StreamResult(System.out));
}


来源:https://stackoverflow.com/questions/2560783/i-need-to-parse-non-well-formed-xml-data-html

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!