Is there an XmlReader equivalent for HTML in .Net?

妖精的绣舞 提交于 2019-12-07 07:54:47

问题


I've used HtmlAgilityPack in the past to parse HTML in .Net but I don't like the fact that it only uses a DOM model.

On large documents and/or those with heavy levels of nesting it is possible to hit stack overflow or out of memory exceptions. Also in general a DOM based parsing model uses significantly more memory than a streaming based approach, typically because the process that wants to consume the HTML may only need a few elements to be available at a time.

Does anyone know of a decent HTML parser for .Net that allows you to parse HTML in a manner similar to the XmlReader class? i.e. in a forward only streaming manner


回答1:


I usually use SgmlReader for this: https://github.com/MindTouch/SGMLReader

Like others have said, there are issues in that HTML doesn't follow the same well-formed rules of XML, so it is inherently difficult to parse, but SgmlReader usually does a pretty good job.




回答2:


The problem is that HTML can be malformed. And you can't know which tag is missing an end tag (or which tags are placed in the incorrect order) until you have parsed a larger part of the document.

If the documents that you'll parsed is well formed, why don't you use the XmlReader?



来源:https://stackoverflow.com/questions/6452433/is-there-an-xmlreader-equivalent-for-html-in-net

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!