what is meant by BOM? [closed]

夙愿已清 提交于 2021-01-27 07:02:52

问题


What is meant by BOM ? I tried reading this article but haven't really understood what does it mean.

I read that some text editors put BOM before the beginning of a file. What it is meant for ?


回答1:


BOM stands for Byte Order Mark. In short, the BOM is marker at the beginning of a file to indicate if the most significant byte, or the least significant byte should come first.

It causes a lot of problems, especially with UTF8. UTF8 does not use a BOM, but there is a variant called UTF8Y (Or UTF with BOM) that includes a few extra characters at the beginning of a file.

Sending a UTF8Y file, with a UTF8 encoding type, causes a few extra bytes to be sent at the beginning of the file and can cause all sorts of hard-to-track down problems including the DOCTYPE not being parsed correctly one IE or JSON files to fail to be decoded.

It has bitten me a few times with files from other people, when I didn't check the filetype carefully.

My recommendation: Be mindful it exists, never purposefully use it.




回答2:


A byte order mark allows a program to determine how to read Unicode data. From your Wiki page:

Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in.

For UTF-8, there is no ambiguity over how to read the bytes and hence a BOM is often omitted. For UTF-16 and UTF-32 it is necessary to know how to interpret the bytes and a BOM can serve this purpose.

Note that Java has problems with reading UTF-8 BOMs and you must manually handle these characters if present (see Reading UTF-8 - BOM marker for some links to the related Sun bugs).




回答3:


I'm probably going to cover stuff you already know, but here goes...

To understand the purpose of a BOM, you need to understand (at least conceptually) what endian-ness is all about.

If you're dealing with a single byte (8 binary bits), it is ordered of increasing significance from right to left (just like reading a normal decimal number, like "19"). That's simple enough as long as you can contain the number in a single byte. Once you get to two bytes, you need to know which of the two bytes is more significant, which is either big endian or little endian. Big endian means that the lowest memory address (or the left-most, to continue the analogy to writing) contains the higher values - it continues the trend of Western decimal numbers. Historically, Intel has been little endian, and Motorola has been big endian. (I haven't looked lately, that may be different now.)

The BOM is simply a marker saying which way to interpret the byte order of the data.




回答4:


Today, this is simply meant to say, "This file is in UTF-8". Or, "This file is in UTF-16". While it is still the same BOM character in both cases, the way the BOM is encoded implies how all the rest will be encoded.

If you do not know what the first character is, you cannot deduce the document encoding from it reliably - you have to determine it from somewhere else, or more or less guess it.

Post-downvote appendix:

Historically, the BOM had a different purpose - a zero width whitespace character (that is, as invisible as a Unicode character can be, but still a charater). Lots of widely used software libraries such as .NET and Java are adding the BOM automatically or implicitly to written files or even byte arrays, which often tricks people into thinking that they are not using the BOM when they do. This often backfires when a stack of such libraries writes multiple BOMs at the beginning of the same file, because then your file begins with an illegal or unwanted character, the zero width unbreakable space; and you do not even see it when you inspect!

No wonder the BOM technique does not have it good with everyone.



来源:https://stackoverflow.com/questions/12860120/what-is-meant-by-bom

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!