Why do DocBook generated XHTML5 Section titles have ASCII #c2 characters in them?

独自空忆成欢 提交于 2019-12-11 17:56:40

问题


I noticed my generated XHTML5 numbered section titles have a  between the number and the title string. I thought this was a generation error. But no, the gentext file of my DocBook distribution, common/en.xml, actually specifies this.

Line 338 of common/en.xml:

<l:template name="section" text="%n. %t"/>

The dot and space following the %n are, when viewed in a hex editor, ASCII character codes C2 and A0, which are the  and NBSP characters respectively. I can understand NBSP. But why the �

I understand I can change this in my customization layer. But the default seems odd.

I'm using docbook-xsl-ns-1.77.1.


回答1:


That is because the encoding is UTF-8, which is the normal Unicode encoding for text these days. In UTF-8, any character above 0x7F is represented by a sequence of 2, 3, or 4 bytes depending on how many significant code bits it contains.

The 0xC2 is one of the chars that starts a 2-byte sequence. In binary, it's 1100 0010. The two 1 bits denote a 2-char sequence, and the bottom five bits are the first five of the encoded character. The second one, 0xA0, is 1001 0000. The single leading 1 bit (followed by a 0 bit) denotes a continuation of the sequence, and the bottom 6 bits are the bottom bits of the encoded character.

Putting the bottom five bits from the first byte together with the bottom six bits from the second, we get 000 1001 0000, in hex U+A0, which is indeed the nonbreaking space.



来源:https://stackoverflow.com/questions/11770764/why-do-docbook-generated-xhtml5-section-titles-have-ascii-c2-characters-in-them

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!