How to parse a xhtml ignoring the DOCTYPE declaration using DOM parser

女生的网名这么多〃 提交于 2019-12-10 15:39:30

问题


I face issue parsing xhtml with DOCTYPE declaration using DOM parser.

Error: java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd%20

Declaration: DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

Is there a way to parse the xhtml to a Document object ignoring the DOCTYPE declaration.


回答1:


A solution that works for me is to give the DocumentBuilder a fake Resolver that returns an empty stream. There's a good explanation here (look at the last message from kdgregory)

http://forums.sun.com/thread.jspa?threadID=5362097

here's kdgregory's solution:

documentBuilder.setEntityResolver(new EntityResolver()
        {
            public InputSource resolveEntity(String publicId, String systemId)
                throws SAXException, IOException
            {
                return new InputSource(new StringReader(""));
            }
        });



回答2:


The parser is required to download the DTD, but you may get around it by setting the standalone attribute on the <?xml... ?> line.

Note however, that this particular error is most likely triggered by a confusion between XML Schema definitions and DTD URL's. See http://www.w3schools.com/xhtml/xhtml_dtd.asp for details. The right one is:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">



回答3:


The easiest thing to do is to set validating=false in your DocumentBuilderFactory. If you want to do validation, download the DTD and use a local copy. As commented by Rachel above, this is discussed at The WWW Consortium.

In short, because the default DocumentBuilderFactory downloads the DTD every time it validates, the W3 was getting hit every time a typical programmer tried to parse an XHTML file in Java. They can't afford that much traffic, so they respond with an error.




回答4:


Instead of the fake resolver, the following code snippet instructs the parser to really ignore the external DTD from the DOCTYPE declaration:

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;

(...)

DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
f.setValidating(false);
f.setAttribute("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
DocumentBuilder builder = f.newDocumentBuilder();
Document document = builder.parse( ... )


来源:https://stackoverflow.com/questions/2640825/how-to-parse-a-xhtml-ignoring-the-doctype-declaration-using-dom-parser

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!