How to avoid DOM parsing adding html doctype, and tags?

问题

<?
    $string = '
    Some photos<br>
    <span class="naslov_slike">photo_by_ile_IMG_1676-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1699-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1697-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1695-01</span><br />    
    ';

    $dom = new DOMDocument();
    $dom->loadHTML($string);
    $dom->preserveWhiteSpace = false;
    $elements = $dom->getElementsByTagName('span');
    $spans = array();
    foreach($elements as $span) {
        $spans[] = $span;
    }
    foreach($spans as $span) {
        $span->parentNode->removeChild($span);
    }
    echo $dom->saveHTML();


?>

I'm using this code to parse strings. When string is returned by this function, it has some added tags:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Some photos<br><br><br><br><br></p></body></html>

Is there any way to avoid this and to have clean string returned? This input string is just for example, in usage it can be any html string.

回答1:

I'm actually looking for the same solution. I've been using the following method to do this, however the <p> around the text node will still be added when you do loadHTML(). I don't there's a way to get around that without using another parser, or there's some hidden flag to tell it to not do that.

This code:

<?php

function innerHTML($node){
  $doc = new DOMDocument();
  foreach ($node->childNodes as $child)
    $doc->appendChild($doc->importNode($child, true));

  return $doc->saveHTML();
}

 $string = '
    Some photos<br>
    <span class="naslov_slike">photo_by_ile_IMG_1676-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1699-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1697-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1695-01</span><br />    
    ';

    $dom = new DOMDocument();
    $dom->preserveWhiteSpace = false;
    $dom->loadHTML($string);
    $elements = $dom->getElementsByTagName('span');
    $spans = array();
    foreach($elements as $span) {
        $spans[] = $span;
    }
    foreach($spans as $span) {
        $span->parentNode->removeChild($span);
    }

    echo innerHTML( $dom->documentElement->firstChild );

Will output:

<p>Some photos<br><br><br><br><br></p>

However of course this solution does not keep the markup 100% intact, but it's close.

回答2:

Hey why not answer a 9 year old question? PHP version 5.4 (released 3 years after this question was asked) added the options parameter to DomDocument::loadHTML(). With it you can do this:

$dom = new DomDocument();
$dom->loadHTML($string, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
// do stuff
echo $dom->saveHTML();

We pass two constants: LIBXML_HTML_NODEFDTD says not to add a document type definition, and LIBXML_HTML_NOIMPLIED says not to add implied elements like <html> and <body>.

回答3:

After using loadHTML, you can do this:

# loadHTML causes a !DOCTYPE tag to be added, so remove it:
$dom->removeChild($dom->firstChild);

# it also wraps the code in <html><body></body></html>, so remove that:
$dom->replaceChild($dom->firstChild->firstChild->firstChild, $dom->firstChild);

The !DOCTYPE tag will be removed, and the first tag inside the body tag will replace the html tag.

Obviously, this will only work if you're only interested in the first tag inside the body, as I was when I encountered this problem. But this example could be adapted to copy everything inside the body with a little bit of effort.

Edit: Meh, nevermind. I like meder's solution.

回答4:

You could always just use a regex to strip that first bit out:

echo preg_replace("/<!DOCTYPE [^>]+>/", "", $dom->saveHTML());

回答5:

From the manual: http://php.net/manual/en/domdocument.savehtml.php

$html_fragment = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML()));

Works for me.

回答6:

I'm not sure if either of these will actually work, but you could try using DOMImplementation::createDocument when constructing your DOMDocument - the third argument is the DOCTYPE you wish to use.

Also, instead of saveHTML(), you could try saveXML()

来源：https://stackoverflow.com/questions/1528190/how-to-avoid-dom-parsing-adding-html-doctype-head-and-body-tags

标签

php

parsing

dom

How to avoid DOM parsing adding html doctype, <head> and <body> tags?