Parsing of badly formatted HTML in PHP
问题 In my code I convert some styled xls document to html using openoffice. I then parse the tables using xml_parser_create . The problem is that openoffice creates oldschool html with unclosed <BR> and <HR> tags, it doesn't create doctypes and don't quote attributes <TABLE WIDTH=4> . The php parsers I know off don't like this, and yield xml formatting errors. My current solution is to run some regexes over the file before I parse it, but this is neither nice nor fast. Do you know a (hopefully