PHP Simple HTML Dom Parser Memory Leak / Usage

五迷三道 提交于 2019-12-30 03:43:06

问题


I'm trying to use PHP Simple HTML Dom Parser to parse some information from some sites. Does not matter what and where. But it seems, that there is some HUGE memory problem with it. I managed to cut the html code to only 6kB, but script that finds some elements and saves them to database takes even 700MB of ram and over 1GB of virtual memory! I read somewhere that I should use ->clear() to free up some memory, but seems that this is not the case.

I use str_get_html() once and 5 times using ->find() assigning the result to variable.

$main_html = str_get_html($main_site);
$x = $main_html->find(...);
$y = $main_html->find(...);

etc.

I tried to use for example $y->clear() after using $y but I get an error PHP Fatal error: Call to a member function clear() on a non-object even tho $y does exist and if($y) is true. Even foreach($y) echo $y->plaintext does return plaintext of $y.

From htop:

PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
8839 username    20   0 1068M  638M   268 R 23.0  8.0  0:08.41 php myscript.php

What is wrong?

Simple test:

echo "(MEM:".memory_get_usage()."->";
$product = $p->find('a',0)->href;
echo memory_get_usage()."->";
unset($product);
$p->clear();
unset($p);
echo memory_get_usage().")";

The result is:

(MEM:11865648->11866192->11865936)

More readable form:

11865648->
11866192-> (+544 in total)
11865936 (+288 in total)

Of course I can't use $product->clear() as it says that PHP Fatal error: Call to a member function clear() on a non-object


回答1:


Seems there are some memory problems when using str_html_get or similar function that creates simple_html_dom object few times without clearing and destroying the previous one. Especially when using ->find that creates array of simple_html_dom_node objects. Even FAQ on authors site says to clear and destroy previous simple_html_dom object before creating new one, but sometimes it can't be done without additional code and memory.

That's why I created this function, to remove all PHP Simple HTML Dom Parser traces from memory:

function clean_all(&$items,$leave = ''){
    foreach($items as $id => $item){
        if($leave && ((!is_array($leave) && $id == $leave) || (is_array($leave) && in_array($id,$leave)))) continue;
        if($id != 'GLOBALS'){
            if(is_object($item) && ((get_class($item) == 'simple_html_dom') || (get_class($item) == 'simple_html_dom_node'))){
                $items[$id]->clear();
                unset($items[$id]);
            }else if(is_array($item)){
                $first = array_shift($item);
                if(is_object($first) && ((get_class($first) == 'simple_html_dom') || (get_class($first) == 'simple_html_dom_node'))){
                    unset($items[$id]);
                }
                unset($first);
            }
        }
    }
}

Usage:

Clean ALL traces of PHP Simple HTML Dom Parser from memory: clean_all($GLOBALS);

Clean all traces of PHP Simple HTML Dom Parser from memory, except $myobj: clean_all($GLOBALS,'myobj');

Clean all traces of PHP Simple HTML Dom Parser from memory, except list of objects ($myobj1,$myobj2...): clean_all($GLOBALS,array('myobj1','myobj2'));

Hope it will help others too.


Generally I use it when I use str_to_html() two times like:

$site=file_get_contents('http://google.com');
$site_html=str_get_html($site);
foreach($site->find('a') as $a){
   $site2=file_get_contents($a->href);
   $site2_html=str_get_html($site2);
   echo $site2->find('p',0)->plaintext;
}
clean_all($_GLOBALS);

In this example I can't $site_html->clear() before foreach{}, because foreach then will fail. And because calling multiple str_get_html() without clearing previous ones, the redundant dependencies are being broken and clearing it after all leaves memory leaks. That's why my function has to search the defined variables for simple_html_dom objects and clear them manually.

In my case I forked inside foreach and after few steps main php script used like 100MB of memory. And when forked few times, it have been increasing and increasing and finally killing my server to death. Well almost. Of course when PHP script ends, it does free up memory. But when using 8GB of memory, it took like ages to end.




回答2:


I believe you need to call clear() on $main_html

From the docs...

Q: This script is leaking memory seriously... After it finished running, it's not cleaning up dom object properly from memory..

A: Due to PHP5 circular references memory leak, after creating DOM object, you must call $dom->clear() to free memory if call file_get_dom() more than once.

Example:

$html = file_get_html(...); 
// do something... 
$html->clear(); 
unset($html);


来源:https://stackoverflow.com/questions/18090212/php-simple-html-dom-parser-memory-leak-usage

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!