问题

I'm having an issue while parsing HTML with PHP's DOMDocument.

The HMTL i'm parsing has the following script tag:

<script type="text/javascript">
    var showShareBarUI_params_e81 =
    {
        buttonWithCountTemplate: '<div class="sBtnWrap"><a href="#" onclick="$onClick"><div class="sBtn">$text<img src="$iconImg" /></div><div class="sCountBox">$count</div></a></div>',
    }
</script>

This snippet has two problems:

1) The HTML inside the buttonWithCountTemplate var is not escaped. DOMDocument manages this correctly, escaping the characters when parsing it. Not a problem.

2) Near the end, there's a img tag with an unescaped closing tag:

<img src="$iconImg" />

The /> makes DOMDocument think that the script is finished but it lacks the closing tag. If you extract the script using getElementByTagName you'll get the tag closed at this img tag, and the rest will appear as text on the HTML.

My goal is to remove all scripts in this page, so if I do a removeChild() over this tag, the tag is removed but the following part appears as text when rendering the page:

</div><div class="sCountBox">$count</div></a></div>',
        }
    </script>

Fixing the HTML is not a solution because I'm developing a generic parser and needs to handle all types of HTML.

My question is if I should do any sanitization before feeding the HTML to DOMDocument, or if there's an option to enable on DOMDocument to avoid triggering this issue, or even if I can strip all tags before loading the HTML.

Any ideas?

EDIT

After some research, I found out the real problem of the DOMDocument parser. Consider the following HTML:

<div> <!-- Offending div without closing tag -->
<script type="text/javascript">
       var test = '</div>';
       // I should not appear on the result
</script>

Using the following php code to remove script tags (based on Gholizadeh's answer):

<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
libxml_use_internal_errors(true);
$dom->loadHTML(file_get_contents('js.html'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
//@$dom->loadHTMLFile('script.html'); //fix tags if not exist

while($nodes = $dom->getElementsByTagName("script")) {
    if($nodes->length == 0) break;
    $script = $nodes->item(0);
    $script->parentNode->removeChild($script);
}

//return $dom->saveHTML();
$final = $dom->saveHTML();
echo $final;

The result will be the following:

<div> <!-- Offending div without closing tag -->
<p>';
       // I should not appear on the result
</p></div>

The problem is that the first div tag is not closed and seems that DOMDocument takes the div tags inside the JS string as html instead of a simple JS string.

What can I do to solve this? Remember that modifing the HTML is not an option, since I'm developing a generic parser.

回答1:

I tested the following code on a html file like this:

<p>some text 1</p>
<img src="http://www.example.com/images/some_image_1.jpg">
<p>some text 2</p>
<p>some text 3</p>
<img src="http://www.example.com/images/some_image_2.jpg">

<script type="text/javascript">
    var showShareBarUI_params_e81 =
    {
        buttonWithCountTemplate: '<div class="sBtnWrap"><a href="#" onclick="$onClick"><div class="sBtn">$text<img src="$iconImg" /></div><div class="sCountBox">$count</div></a></div>',
    }
</script>

<p>some text 4</p>
<p>some text 5</p>
<img src="http://www.example.com/images/some_image_3.jpg">

the php code is:

<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);

    $dom = new DOMDocument;
    $dom->preserveWhiteSpace = false;
    @$dom->loadHTML(file_get_contents('script.html'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    //@$dom->loadHTMLFile('script.html'); //fix tags if not exist 

    $nodes = $dom->getElementsByTagName("script");

    foreach($nodes as $i => $node){
        $script = $nodes->item($i);
        $script->parentNode->removeChild($script);
    }

    //return $dom->saveHTML();
    $dom->saveHtmlFile('script.html');

and it works on the given example I think you should use options I used in loading html code.

Edited according to last question updates:

Actually You can't parse [X]HTML with regex (read this link for more information) but if your only purpose is to remove just script tags and you can make sure there is no </script> tag as a string between it. you can use this regex:

$html = mb_convert_encoding(file_get_contents('script2.html'), 'HTML-ENTITIES', 'UTF-8');
$new_html = preg_replace('/<script(.*?)>(.*?)<\/script>/si', '', $html);
file_put_contents('script-result.html', $new_html);

frankly the problem is you may have not a standard HTML code. but I think it's better to try other libraries linked here.

otherwise I guess you should write a special parser to remove script tag and take care of single quote and double quotes inside.

回答2:

i am offering different aproach to your problem:

My goal is to remove all scripts in this page

then you can remove them with preg_replace_callback function and parse the html as DOM after that. Here is working demo: demo

$htmlWithScript = "<html><body><div>something></div><script type=\"text/javascript\">
var showShareBarUI_params_e81 =
{
    buttonWithCountTemplate: '<div class=\"sBtnWrap\"><a href=\"#\" onclick=\"\$onClick\"><div class=\"sBtn\">\$text<img src=\"\$iconImg\" /></div><div class=\"sCountBox\">\$count</div></a></div>',
}
</script></body></html>";



$htmlWithoutScript = preg_replace_callback('~<script.*>.*</script>~Uis', function($matches){
return '';
}, $htmlWithScript);

EDIT

But how do I do this without summoning Cthulhu?

nice comment, but i don't know what you are asking :) If it is loading the html, then you can load html with file_get_contents()

If you do not understand how it will remove tags: preg_replace_callback allows you to search matches against regexp and transform them. In this situation remove them (return '';) Regexp is looking for starting tag of with any attributes (.*) and any content between ending tag

Modificators:

U -> means ungreedy (shortest match possible)

i -> case insensitive ( will be matched as well)

s -> whitespace is included in . (dot) characted (newline will not break match)

I hope this clarifies it a bit..

回答3:

Have you tried setting libxml to use internal errors?

$use_errors = libxml_use_internal_errors(true);
// your parsing code here
libxml_clear_errors();
libxml_use_internal_errors($use_errors);

It might allow dom document to continue parsing(maybe).

回答4:

Parsing html documents is mostly about its content and not scripts. Espacially using those script without knowing its behaviour and origin might be dangerous and risky.

So when it comes to html content you can ommit scripts with this approach (which I've already pointed in comment): How to combine PHP's DOMDocument with a JavaScript template

To be specific with your example:

<?php
$html = <<<END
<!DOCTYPE html>
<html><body><h1>Hey now</h1>
<script type="text/javascript">
    var showShareBarUI_params_e81 =
    {
        buttonWithCountTemplate: '<div class="sBtnWrap"><a href="#" onclick="onClick"><div class="sBtn">text<img src="iconImg" /></div><div class="sCountBox">count</div></a></div>'
    }
</script>
</body></html>
END;

$dom = new DOMDocument();
$dom->preserveWhiteSpace = true; // needs to be before loading, to have any effect
$dom->loadXML($html);
    while (($r = $dom->getElementsByTagName("script")) && $r->length) {
        $r->item(0)->parentNode->removeChild($r->item(0));
    }
$dom->formatOutput = false;
print $dom->saveHTML();

//Outputs
//<!DOCTYPE html><html><head></head><body><h1>Hey now</h1></body></html>

You can also try using some regular expressions to remove script tags before loading to DOMDocument or check other html parsing libraries. Finally you have to realize that in some cases even perfect expression will break and DOMDocument parser is not as good as true browser engine. Everything comes to purpose of your parsing and finding best solutions for it.

PHP Simple HTML DOM Parser Example:

http://simplehtmldom.sourceforge.net/manual.htm

require_once 'libs/simplehtmldom_1_5/simple_html_dom.php';
$html = <<<END
<div> <!-- Offending div without closing tag -->
<script type="text/javascript">
       var test = '</div>';
       // I should not appear on the result
</script>
END;

$dom = str_get_html($html);
echo $dom;

//outputs with no error or warnings
//<div> <!-- Offending div without closing tag --><script type="text/javascript">var test = '</div>';// I should not appear on the result  </script>

来源：https://stackoverflow.com/questions/40703313/php-domdocument-errors-while-parsing-unescaped-strings

标签

php