问题
Previous posts here and here both suggest appending a resource with the correct encoding i.e. UTF-8.
Additionally, in reading similar articles here and here, the recommendation is to use <?xml version="1.0" encoding="UTF-8"?>
instead.
It isn't immediately clear (to me), if a page already includes <meta charset="UTF-8">
, that loadHTML
can be limited to $output = $dom->loadHTML($output, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
. I am assuming yes. The page being parsed is HTML.
Equally, I have presumed that $output = $dom->saveHTML();
will support both English and international languages e.g. Cyrillic, Arabic if the page has the correct encoding.
Issues
- If
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
is appended, the W3C validator returns the messagesConsider adding a lang attribute to the html start tag to declare the language of this document.
andStart tag seen without seeing a doctype first. Expected <!DOCTYPE html>
as element is appended prior to the HTML tag. - If
<?xml encoding="utf-8" ?>
is appended, similarly the validator complainsSaw <?. Probable cause: Attempt to use an XML processing instruction in HTML. (XML processing instructions are not supported in HTML.)
Questions
- Should
<?xml version="1.0" encoding="UTF-8"?>
or<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
be used? - Why is it necessary for either if a page already has the correct encoding specified?
- Should
mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');
be used instead? If yes, is it safe to use or does it remain vulnerable to XSS or malformed HTML?
来源:https://stackoverflow.com/questions/64742235/parsing-html-php-domdocument-loadhtml-utf-8-encoding