PHP SAX parser for HTML?

↘锁芯ラ 提交于 2019-12-01 21:11:06

SAX was made to process valid XML and fail on invalid markup. Processing invalid HTML markup requires keeping more state than SAX parsers typically keep.

I'm not aware of any SAX-like parser for HTML. Your best shot is to use to pass the HTML through tidy before and then use a XML parser, but this may defeat your purpose of using a SAX parser in the first place.

Try to use HTML SAX Parser

Peter Krauss

Summarizing as two steps:

  1. Use Tidy to transform "free HTML" into "good XHTML".
  2. Use XML Parser to parse XHTML as XML by SAX API.

Use first Tidy (!), to transform "free HTML" into XHTML (or when you can not trust your "supposed XHTML"). See cleanRepair method. It needs more time, but runs with big files (!)... Set some minutes as maximum execution time if too big.

Another option (for work with big files) is to cache your XHTML files after checked or transformed into XHTML. See Tidy's repairfile method.

With a "trusted XHTML", use SAX... How to use SAX with PHP?

Parse XML with a SAX standard API, that in PHP is implemented by LibXML (see LibXML2 at xmlsoft.org), and its interface is the PHP's XML Parser, that is near to the SAX standard API.

Another way to use the "SAX of LibXML2", with another interface (a PHP iterator instead the traditional SAX interface), is to use XMLReader. See this explanation about "XMLReader use SAX".


Yes, the terms "SAX" or "SAX API" not expressed in the PHP manual (!!). See this old but good introduction.

I may suggest the pear package here : http://pear.php.net/package/XML_HTMLSax/redirected

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!