Error Tolerant HTML/XML/SGML parsing in PHP

后端 未结 6 788
难免孤独
难免孤独 2020-12-06 13:48

I have a bunch of legacy documents that are HTML-like. As in, they look like HTML, but have additional made up tags that aren\'t a part of HTML



        
6条回答
  •  青春惊慌失措
    2020-12-06 14:51

    @Alan Storm

    Your comment on my other answer got me to thinking:

    When you load an HTML file with DOMDocument, it appears to do some level of cleanup re: well well-formedness, BUT requires all your tags to be legit HTML tags. I'm looking for something that does the former, but not the later. (Alan Storm)

    Run a regex (sorry!) over the tags, and when it finds one which isn't a valid HTML element, replace it with a valid element that you know doesn't exist in any of the documents (blink comes to mind...), and give it an attribute value with the name of the illegal element, so that you can switch it back afterwards. eg:

    $code = str_replace("", "", $code);
    // and then back again...
    $code = preg_replace('', '<\1>', $code);
    

    obviously that code won't work, but you get the general idea?

提交回复
热议问题