php extract body tag content

问题

I'm trying what should be very easy, but I can't get it to work. Which makes me wonder if I'm using the right workflow.

I have a simple html page which I load in my desktop application as a help file. This page has no menu just the content. On my website I want to have a more sophisticated help system. So I want to use a php file which will show a menu, breadcrums and a header and footer. To not duplicate my help content I want to load the original HTML help file and add its body content to my enhanced help page.

I'm using this code to extract the title:

function getURLContent($filename){
    $url = realpath(dirname(__FILE__)) . DIRECTORY_SEPARATOR . $filename;
    $doc = new DOMDocument;
    $doc->preserveWhiteSpace = FALSE;
    @$doc->loadHTMLFile($url);
    return $doc;
}

function getSingleElementValue($element){
  if (!is_null($element)) {
    $node = $element->childNodes->item(0);
    return $node->nodeValue;
  }
} 

$doc = getURLContent("test.html");
$title = getSingleElementValue($doc->getElementsByTagName('title')->item(0));
echo $title;

The title is correctly extracted.

Now I try to extract the body:

function getBodyContent($element){
  $mock = new DOMDocument;
  foreach ($element->childNodes as $child){
      $mock->appendChild($mock->importNode($child, true));
  }        
  return $mock->saveHTML();
}

$body = getBodyContent($doc->getElementsByTagName('body')->item(0));
echo $body;

The getBodyContent() function is one of the several options I tried. All of them return the whole HTML tag, including the HEAD tag.

My question is: Is this a correct workflow or should I use something else?

Thanks.

Update: My final goal is to have a website with multiple pages that has the help files accessible via a menu. These pages will be generated using something like generate.php?page=test.html. I'm not yet at this part. The goal is also to not duplicate the content of test.html because this file will be used in my desktop application (using a web control). In my desktop application I don't need the menu and such.

Update #2: I had to add <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> to the html-file I want to read and now I do get the body content. Unfortunaly all tags are strips. I'll need to fixed that as well.

回答1:

The problem is that saveHTML() will return an actual document. You don't want this. Instead, you want just what you put in.

Thankfully, you can do this much more easily.

function getBodyContent(DOMNode $element) {
    $doc = $element->ownerDocument;
    $wrapper = $doc->createElement('div');
    foreach( $element->childNodes as $child) {
        $wrapper->appendChild($child);
    }
    $element->appendChild($wrapper);
    $html = $doc->saveHTML($wrapper);
    return substr($html, strlen("<div>"), -strlen("</div>"));
}

This wraps the contents into a single element of known tag representation (the body may have attributes that make it unknown), gets the rendered HTML from that element, and strips off the known tag of the wrapper.

I'd also like to suggest an improvement to getSingleElementValue:

function getSingleElementValue(DOMNode $element) {
    return trim($element->textContent);
}

Note also the use of type hints to ensure that your functions are indeed getting the kind of thing that is expected - this is useful as it means we no longer need to check "does $element->ownerDocument exist? does $element->ownerDocument->saveHTML() do what we think it does?" and other such questions. It ensures we have a DOMNode, so we know it has those things.

来源：https://stackoverflow.com/questions/34436375/php-extract-body-tag-content

标签

php

html

html-parsing