Prefer charset declaration in HTML meta tag or HTTP header?

梦想与她 提交于 2019-11-30 08:37:33

问题


I'm parsing a lot of sites. All works fine, I'm reading also charset declarations to convert encodings. Now I've a problem with http://celleheute.de/sonntagsfuhrung-3/.

The HTML meta tag says, that the content is encoded as ISO-8859-2, but the HTTP header says, it's UTF-8. And really, the content is UTF encoded, so when my parser tries to convert the content to ISO it will break some chars.

Now my question is, which declaration should I prefer? Should I ignore meta tags, when I can find the declaration in HTTP header or vice versa? What will most web browsers do?


回答1:


To understand what modern browsers do, you should start reading at http://w3c.github.io/html/syntax.html#determining-the-character-encoding

Steps one and two are most relevant to the question. They say

  1. If the user has explicitly instructed the user agent to override the document's character encoding with a specific encoding, optionally return that encoding with the confidence certain and abort these steps.

  2. If the transport layer specifies an encoding, and it is supported, return that encoding with the confidence certain, and abort these steps.

which means that the real HTTP header takes precedence over everything except user over-ride.

Beyond that it can get complex. A byte order mark, can for example, take precedence over the meta tag.


UPDATE: Since this answer was written, the spec changed (around mid-2012) so that the byte order mark now takes precedence over the HTTP header.




回答2:


There's simply no answer to this. The author of the page has committed an error by giving conflicting information. Which one is correct may as well be decided by a coin toss.

In general, I'd prefer the HTTP header as the primary value. The meta tag is just meant as a fallback anyway. If you want to follow any logic at all, first try to decode the document using the charset specified in the HTTP header. If that clearly fails, because certain bytes are invalid in the given encoding, try again in the charset specified in the meta tag, if any. If that still fails, all bets are off.

If neither fails but the encodings conflict, either involve a human or try some statical analysis on the decoded text, which may tell you which is more likely to be correct.



来源:https://stackoverflow.com/questions/7102925/prefer-charset-declaration-in-html-meta-tag-or-http-header

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!