Parsing HTML - PHP DOMDocument loadHTML UTF-8 encoding

£可爱£侵袭症+ 提交于 2021-01-29 14:55:06

问题


Previous posts here and here both suggest appending a resource with the correct encoding i.e. UTF-8.

Additionally, in reading similar articles here and here, the recommendation is to use <?xml version="1.0" encoding="UTF-8"?> instead.

It isn't immediately clear (to me), if a page already includes <meta charset="UTF-8">, that loadHTML can be limited to $output = $dom->loadHTML($output, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);. I am assuming yes. The page being parsed is HTML.

Equally, I have presumed that $output = $dom->saveHTML(); will support both English and international languages e.g. Cyrillic, Arabic if the page has the correct encoding.

Issues

  1. If <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> is appended, the W3C validator returns the messages Consider adding a lang attribute to the html start tag to declare the language of this document. and Start tag seen without seeing a doctype first. Expected <!DOCTYPE html> as element is appended prior to the HTML tag.
  2. If <?xml encoding="utf-8" ?> is appended, similarly the validator complains Saw <?. Probable cause: Attempt to use an XML processing instruction in HTML. (XML processing instructions are not supported in HTML.)

Questions

  1. Should <?xml version="1.0" encoding="UTF-8"?> or <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> be used?
  2. Why is it necessary for either if a page already has the correct encoding specified?
  3. Should mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'); be used instead? If yes, is it safe to use or does it remain vulnerable to XSS or malformed HTML?

来源:https://stackoverflow.com/questions/64742235/parsing-html-php-domdocument-loadhtml-utf-8-encoding

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!