how to scrape web page data without losing tags

问题

I am trying to scrape web data using php and dom xpath. When I store the $node->nodeValue into my database or even if i try to echo it, all the tags like <p> and <br> are missing. So I am getting all the paras concatenated. How to solve this problem

回答1:

If you have a node, and you need all its contents as they are, you can use this function:

function innerHTML(DOMNode $node)
{
  $doc = new DOMDocument();
  foreach ($node->childNodes as $child) {
    $doc->appendChild($doc->importNode($child, true));
  }
  return $doc->saveHTML();
}

回答2:

If you're browsing the DOM, most likely there are no longer tags to see. The tags are now nodes within the DOM -- the raw content contained in tags is all you have access to in "string form". You can, of course, use node information to reconstruct the tags, but they won't be the original tags (e.g., you will have to choose <BR> or <br> - you won't know which the site originally had). If you want the original tags from the get go, get the original stream of bytes returned by the GET/POST you did; don't parse it into a DOM tree.

来源：https://stackoverflow.com/questions/5349310/how-to-scrape-web-page-data-without-losing-tags

标签

php

dom

serialization

innerhtml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!