DomDocument class unable access domnode

后端 未结 3 1957
长情又很酷
长情又很酷 2020-12-20 08:57

I dont parse this url: http://foldmunka.net

$ch = curl_init(\"http://foldmunka.net\");

//curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RE         


        
相关标签:
3条回答
  • 2020-12-20 09:19

    I'm not sure I'm getting what this script does - the replace operations look like an attempt at sanitation but I'm not sure what for, if you're just extracting some parts of the code - but have you tried the Simple HTML DOM Browser? It may be able to handle the parsing part more easily. Check out the examples.

    0 讨论(0)
  • 2020-12-20 09:26

    Here is a Simple Html DOM Parser solution just for comparison. It's output is similar for the DomDocument solution's, but this one is more complicated and runs much slower (~2300ms against DomDocument's ~100ms), so I don't recommend to use it:

    Updated to work with <img> elements inside <a> elements.

    <?php
    require_once('simple_html_dom.php');
    // we are needing this because Simple Html DOM Parser's callback handler
    // doesn't handle arguments
    static $processed_plain_text = '';
    
    define('LOAD_FROM_URL', 'loadfromurl');
    define('LOAD_FROM_STRING', 'loadfromstring');
    
    function callback_cleanNestedAnchorContent($element)
    {
        if ($element->tag == 'a')
            $element->innertext = makePlainText($element->innertext, LOAD_FROM_STRING);
    }
    
    function callback_buildPlainText($element)
    {
        global $processed_plain_text;
    
        $excluded_tags = array('script', 'style');
    
        switch ($element->tag)
        {
            case 'text':
                // filter when 'text' is descendant of 'a', because we are
                // processing the anchor tags with the required attributes
                // separately at the 'a' tag,
                // and also filter out other unneccessary tags
                if (($element->parent->tag != 'a') && !in_array($element->parent->tag, $excluded_tags))
                    $processed_plain_text .= $element->innertext . ' ';
                break;
            case 'img':
                $processed_plain_text .= $element->alt . ' ';
                $processed_plain_text .= $element->title . ' ';
                break;
            case 'a':
                $processed_plain_text .= $element->alt . ' ';
                $processed_plain_text .= $element->title . ' ';
                $processed_plain_text .= $element->innertext . ' ';
                break;
        }
    }
    
    function makePlainText($source, $mode = LOAD_FROM_URL)
    {
        global $processed_plain_text;
    
        if ($mode == LOAD_FROM_URL)
            $html = file_get_html($source);
        elseif ($mode == LOAD_FROM_STRING)
            $html = str_get_dom ($source);
        else
            return 'Wrong mode defined in makePlainText: ' . $mode;
    
        $html->set_callback('callback_cleanNestedAnchorContent');
    
        // processing with the first callback to clean up the anchor tags
        $html = str_get_html($html->save());
        $html->set_callback('callback_buildPlainText');
    
        // processing with the second callback to build the full plain text with
        // the required attributes of the 'img' and 'a' tags, and excluding the
        // unneccessary ones like script and style tags
        $html->save();
    
        $return = $processed_plain_text;
    
        // cleaning the global variable
        $processed_plain_text = '';
    
        return $return;
    }
    
    //$html = '<html><title>Hello</title><body>Hello <span>this</span> site<img src="asdasd.jpg" alt="alt attr" title="title attr"><a href="open.php" alt="alt attr" title="title attr">click <span><strong>HERE</strong></span><img src="image.jpg" title="IMAGE TITLE INSIDE ANCHOR" alt="ALTINACNHOR"></a> Some text.</body></html>';
    
    echo makePlainText('http://foldmunka.net');
    //echo makePlainText($html, LOAD_FROM_STRING);
    
    0 讨论(0)
  • 2020-12-20 09:43

    Here is a solution with DomDocument and DOMXPath. It is much shorter and runs much faster (~100ms against ~2300ms) than the other solution with Simple HTML DOM Parser.

    <?php
    
    function makePlainText($source)
    {
        $dom = new DOMDocument();
        $dom->loadHtmlFile($source);
    
        // use this instead of loadHtmlFile() to load from string:
        //$dom->loadHtml('<html><title>Hello</title><body>Hello this site<img src="asdasd.jpg" alt="alt attr" title="title attr"><a href="open.php" alt="alt attr" title="title attr">click</a> Some text.</body></html>');
    
        $xpath = new DOMXPath($dom);
    
        $plain = '';
    
        foreach ($xpath->query('//text()|//a|//img') as $node)
        {
            if ($node->nodeName == '#cdata-section')
                continue;
    
            if ($node instanceof DOMElement)
            {
                if ($node->hasAttribute('alt'))
                    $plain .= $node->getAttribute('alt') . ' ';
                if ($node->hasAttribute('title'))
                    $plain .= $node->getAttribute('title') . ' ';
            }
            if ($node instanceof DOMText)
                $plain .= $node->textContent . ' ';
        }
    
        return $plain;
    }
    
    echo makePlainText('http://foldmunka.net');
    
    0 讨论(0)
提交回复
热议问题