问题
Previous posts here and here both suggest appending a resource with the correct encoding i.e. UTF-8.
Additionally, in reading similar articles here and here, the recommendation is to use <?xml version="1.0" encoding="UTF-8"?> instead.
It isn't immediately clear (to me), if a page already includes <meta charset="UTF-8">, that loadHTML can be limited to $output = $dom->loadHTML($output, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);. I am assuming yes. The page being parsed is HTML.
Equally, I have presumed that $output = $dom->saveHTML(); will support both English and international languages e.g. Cyrillic, Arabic if the page has the correct encoding.
Issues
- If
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">is appended, the W3C validator returns the messagesConsider adding a lang attribute to the html start tag to declare the language of this document.andStart tag seen without seeing a doctype first. Expected <!DOCTYPE html>as element is appended prior to the HTML tag. - If
<?xml encoding="utf-8" ?>is appended, similarly the validator complainsSaw <?. Probable cause: Attempt to use an XML processing instruction in HTML. (XML processing instructions are not supported in HTML.)
Questions
- Should
<?xml version="1.0" encoding="UTF-8"?>or<meta http-equiv="Content-Type" content="text/html; charset=utf-8">be used? - Why is it necessary for either if a page already has the correct encoding specified?
- Should
mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');be used instead? If yes, is it safe to use or does it remain vulnerable to XSS or malformed HTML?
来源:https://stackoverflow.com/questions/64742235/parsing-html-php-domdocument-loadhtml-utf-8-encoding