I have a bunch of legacy documents that are HTML-like. As in, they look like HTML, but have additional made up tags that aren\'t a part of HTML
I wonder if passing the "bad" HTML through HTML Tidy might help as a first pass? Might be worth a look, if you can get the document to be well formed, maybe you could load it as a regular XML file with DomDocument.