Parsing of badly formatted HTML in PHP

帅比萌擦擦* 提交于 2019-11-28 09:20:08

A solution to "fix" broken HTML could be to use HTMLPurifier (quoting) :

HTML Purifier is a standards-compliant HTML filter library written in PHP.
HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant


An alternative idea might be to try loading your HTML with DOMDocument::loadHTML (quoting) :

The function parses the HTML contained in the string source . Unlike loading XML, HTML does not have to be well-formed to load.

And if you're trying to load HTML from a file, see DOMDocument::loadHTMLFile.

There is SimpleHTML

For repairing broken HTML, you could use Tidy.

As an alternative you can use the native XML Reader. Because it is acts as a cursor going forward on the document stream and stopping at each node on the way, it will not break on invalid XML documents.

See http://www.ibm.com/developerworks/library/x-pullparsingphp.html

Any particular reason you're still using the PHP 4 XML API?

If you can get away with using PHP 5's XML API, there are two possibilities.

First, try the built-in HTML parser. It's really not very good (it tends to choke on poorly formatted HTML), but it might do the trick. Have a look at DomDocument::LoadHTML.

Second option - you could try the HTML parser based on the HTML5 parser specification:

http://code.google.com/p/html5lib/

This tends to work better than the built-in PHP HTML parser. It loads the HTML into a DomDocument object.

A solution is to use DOMDocument.

Example :

$str = "
<html>
 <head>
  <title>test</title>
 </head>
 <body>
  </div>error.
  <p>another error</i>
 </body>
</html>
";

$doc = new DOMDocument();
@$doc->loadHTML($str);
echo $doc->saveHTML();

Advantage : natively included in PHP, contrary to PHP Tidy.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!