Parsing HTML - PHP DOMDocument loadHTML UTF-8 encoding

问题

Previous posts here and here both suggest appending a resource with the correct encoding i.e. UTF-8.

Additionally, in reading similar articles here and here, the recommendation is to use <?xml version="1.0" encoding="UTF-8"?> instead.

It isn't immediately clear (to me), if a page already includes <meta charset="UTF-8">, that loadHTML can be limited to $output = $dom->loadHTML($output, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);. I am assuming yes. The page being parsed is HTML.

Equally, I have presumed that $output = $dom->saveHTML(); will support both English and international languages e.g. Cyrillic, Arabic if the page has the correct encoding.

Issues

If <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> is appended, the W3C validator returns the messages Consider adding a lang attribute to the html start tag to declare the language of this document. and Start tag seen without seeing a doctype first. Expected <!DOCTYPE html> as element is appended prior to the HTML tag.
If <?xml encoding="utf-8" ?> is appended, similarly the validator complains Saw <?. Probable cause: Attempt to use an XML processing instruction in HTML. (XML processing instructions are not supported in HTML.)

Questions

Should <?xml version="1.0" encoding="UTF-8"?> or <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> be used?
Why is it necessary for either if a page already has the correct encoding specified?
Should mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'); be used instead? If yes, is it safe to use or does it remain vulnerable to XSS or malformed HTML?

来源：https://stackoverflow.com/questions/64742235/parsing-html-php-domdocument-loadhtml-utf-8-encoding

标签

php

parsing

encoding