In my code I convert some styled xls document to html using openoffice.
I then parse the tables using xml_parser_create
.
The problem is that openoffice creates oldschool html with unclosed <BR>
and <HR>
tags, it doesn't create doctypes and don't quote attributes <TABLE WIDTH=4>
.
The php parsers I know off don't like this, and yield xml formatting errors. My current solution is to run some regexes over the file before I parse it, but this is neither nice nor fast.
Do you know a (hopefully included) php-parser, that doesn't care about these kinds of mistakes? Or perhaps a fast way to fix a 'broken' html?
A solution to "fix" broken HTML could be to use HTMLPurifier (quoting) :
HTML Purifier is a standards-compliant HTML filter library written in PHP.
HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant
An alternative idea might be to try loading your HTML with DOMDocument::loadHTML
(quoting) :
The function parses the HTML contained in the string source . Unlike loading XML, HTML does not have to be well-formed to load.
And if you're trying to load HTML from a file, see DOMDocument::loadHTMLFile
.
There is SimpleHTML
For repairing broken HTML, you could use Tidy.
As an alternative you can use the native XML Reader. Because it is acts as a cursor going forward on the document stream and stopping at each node on the way, it will not break on invalid XML documents.
See http://www.ibm.com/developerworks/library/x-pullparsingphp.html
Any particular reason you're still using the PHP 4 XML API?
If you can get away with using PHP 5's XML API, there are two possibilities.
First, try the built-in HTML parser. It's really not very good (it tends to choke on poorly formatted HTML), but it might do the trick. Have a look at DomDocument::LoadHTML.
Second option - you could try the HTML parser based on the HTML5 parser specification:
http://code.google.com/p/html5lib/
This tends to work better than the built-in PHP HTML parser. It loads the HTML into a DomDocument object.
A solution is to use DOMDocument.
Example :
$str = "
<html>
<head>
<title>test</title>
</head>
<body>
</div>error.
<p>another error</i>
</body>
</html>
";
$doc = new DOMDocument();
@$doc->loadHTML($str);
echo $doc->saveHTML();
Advantage : natively included in PHP, contrary to PHP Tidy.
来源:https://stackoverflow.com/questions/2351526/parsing-of-badly-formatted-html-in-php