php extract body tag content

删除回忆录丶 提交于 2021-02-08 03:33:25

问题


I'm trying what should be very easy, but I can't get it to work. Which makes me wonder if I'm using the right workflow.

I have a simple html page which I load in my desktop application as a help file. This page has no menu just the content. On my website I want to have a more sophisticated help system. So I want to use a php file which will show a menu, breadcrums and a header and footer. To not duplicate my help content I want to load the original HTML help file and add its body content to my enhanced help page.

I'm using this code to extract the title:

function getURLContent($filename){
    $url = realpath(dirname(__FILE__)) . DIRECTORY_SEPARATOR . $filename;
    $doc = new DOMDocument;
    $doc->preserveWhiteSpace = FALSE;
    @$doc->loadHTMLFile($url);
    return $doc;
}

function getSingleElementValue($element){
  if (!is_null($element)) {
    $node = $element->childNodes->item(0);
    return $node->nodeValue;
  }
} 

$doc = getURLContent("test.html");
$title = getSingleElementValue($doc->getElementsByTagName('title')->item(0));
echo $title;

The title is correctly extracted.

Now I try to extract the body:

function getBodyContent($element){
  $mock = new DOMDocument;
  foreach ($element->childNodes as $child){
      $mock->appendChild($mock->importNode($child, true));
  }        
  return $mock->saveHTML();
}

$body = getBodyContent($doc->getElementsByTagName('body')->item(0));
echo $body;

The getBodyContent() function is one of the several options I tried. All of them return the whole HTML tag, including the HEAD tag.

My question is: Is this a correct workflow or should I use something else?

Thanks.

Update: My final goal is to have a website with multiple pages that has the help files accessible via a menu. These pages will be generated using something like generate.php?page=test.html. I'm not yet at this part. The goal is also to not duplicate the content of test.html because this file will be used in my desktop application (using a web control). In my desktop application I don't need the menu and such.

Update #2: I had to add <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> to the html-file I want to read and now I do get the body content. Unfortunaly all tags are strips. I'll need to fixed that as well.


回答1:


The problem is that saveHTML() will return an actual document. You don't want this. Instead, you want just what you put in.

Thankfully, you can do this much more easily.

function getBodyContent(DOMNode $element) {
    $doc = $element->ownerDocument;
    $wrapper = $doc->createElement('div');
    foreach( $element->childNodes as $child) {
        $wrapper->appendChild($child);
    }
    $element->appendChild($wrapper);
    $html = $doc->saveHTML($wrapper);
    return substr($html, strlen("<div>"), -strlen("</div>"));
}

This wraps the contents into a single element of known tag representation (the body may have attributes that make it unknown), gets the rendered HTML from that element, and strips off the known tag of the wrapper.

I'd also like to suggest an improvement to getSingleElementValue:

function getSingleElementValue(DOMNode $element) {
    return trim($element->textContent);
}

Note also the use of type hints to ensure that your functions are indeed getting the kind of thing that is expected - this is useful as it means we no longer need to check "does $element->ownerDocument exist? does $element->ownerDocument->saveHTML() do what we think it does?" and other such questions. It ensures we have a DOMNode, so we know it has those things.



来源:https://stackoverflow.com/questions/34436375/php-extract-body-tag-content

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!