unicode | 易学教程

Java程序员们，请永远不要在MySQL中使用utf8，改用utf8mb4！

阅读更多关于 Java程序员们，请永远不要在MySQL中使用utf8，改用utf8mb4！

最近我遇到了一个bug，我试着通过Rails在以“utf8”编码的MariaDB中保存一个UTF-8字符串，然后出现了一个离奇的错误： Incorrect string value: ‘\xF0\x9F\x98\x83 <…’ for column ‘summary’ at row 1 我用的是UTF-8编码的客户端，服务器也是UTF-8编码的，数据库也是，就连要保存的这个字符串“ <…”也是合法的UTF-8。问题的症结在于，MySQL的“utf8”实际上不是真正的UTF-8。 “utf8”只支持每个字符最多三个字节，而真正的UTF-8是每个字符最多四个字节。 MySQL一直没有修复这个bug，他们在2010年发布了一个叫作“utf8mb4”的字符集，绕过了这个问题。当然，他们并没有对新的字符集广而告之。 MySQL的“utf8mb4”是真正的“UTF-8” MySQL的“utf8”是一种“专属的编码”，它能够编码的Unicode字符并不多。所有在使用“utf8”的MySQL和MariaDB用户都应该改用“utf8mb4”，永远都不要再使用“utf8”。那么什么是编码？什么是UTF-8？我们都知道，计算机使用0和1来存储文本。比如字符“C”被存成“01000011”，那么计算机在显示这个字符时需要经过两个步骤：计算机读取“01000011”，得到数字67

Emacs lisp: Translate characters to standard ASCII transcription

阅读更多关于 Emacs lisp: Translate characters to standard ASCII transcription

问题 I am trying to write a function, that translates a string containing unicode characters into some default ASCII transcription. Ideally I'd like e.g. Ångström to become Angstroem or, if that is not possible, Angstrom . Likewise α=χ should become a=x (c?) or similar. Does Emacs have such built-in capabilities? I know I can get the names and similar of characters ( get-char-code-property ) but I know no built-in transcription table. The purpose is to translate titles of entries into meaningfully

Emacs lisp: Translate characters to standard ASCII transcription

阅读更多关于 Emacs lisp: Translate characters to standard ASCII transcription

第十二篇字符编码

阅读更多关于第十二篇字符编码

第十二篇字符编码 # 预备知识由于计算机语言是一组高低电平，高电平代表1，低电平代表0，计算机中的所有信息都是以二进制代码的形式存在的，无论是文字、图片、声音，还是影像、游戏...... ASCII：美国信息交换标准码（American Standard Code for Information Interchange）类似于摩斯电码，为了让人们能看懂这些二进制代码，美国人最先制定了一种编码规则——ASCII码，用于理解二进制代码到底代表的是哪些字符，ASCII码使用指定的7个二进制位组合，这些组合所能表示的128个整数用来代表大小写字母、0到9的数字、控制字符、通信专用字符、空格符、运算符号、标点符号等的编号，通过编号可以找到一一对应的字符，ASCII码通常会额外使用一个扩充的二进制位，虽然这个二进制位可能并不代表任何意思，但是可以方便以一个字节的方式存储每个字符在美国计算机装的英文系统上的编码表可能是ASCII码也可能是EBCDIC码，而编码格式也一定是ASCII编码格式，才能在输出文本和打开文本时不出现乱码 GBK：汉字字符集国家标准编码 ASCII码只能满足英文在计算机上通信的需要，为了扩充ASCII编码，也为了满足其他语言的通信需要，各国也都制定了各自的属于本国语言的字符编码来方便在计算机上通信，例如我国的GBK 在GBK编码体系下，有些字符使用双字节组合来表示

C++ how to read from unicode files by ignoring first character of each line

阅读更多关于 C++ how to read from unicode files by ignoring first character of each line

问题 Consider a file containing Unicode words as follows آب آباد آبادان if you read right to left, the first character is " آ ". My first requirement is to read the file line by line. This would be simple. The second requirement is to read the file line by line from the second character of each line. the result must be something like this ب باد بادان As you know there are some solutions like std::substr to meet the second requirement but Afaik std::substr does not works well with Unicode

How to Print Box Characters in C (Windows)

阅读更多关于 How to Print Box Characters in C (Windows)

问题 How might one go about printing an em dash in C? One of these: — Whenever I do: printf("—") I just get a ù in the terminal. Thank you. EDIT: The following code is supposed to print out an Xs an Os looking grid with em dashes for the horizontal lines. int main () { char grid[3][3] = {{'a', 'a', 'a'}, {'a', 'a', 'a'}, {'a', 'a', 'a'}}; int i, j; for (i = 0; i < 3; i++) { for (j = 0; j < 3; j++) { if (j != 0) { printf("|"); } printf(" %c ", grid[i][j]); } if (i != 2) { printf("\n——————————————\n

mysql中utf8和utf8mb4区别

阅读更多关于 mysql中utf8和utf8mb4区别

转自：http://ourmysql.com/archives/1402 MySQL在5.5.3之后增加了这个utf8mb4的编码，mb4就是most bytes 4的意思，专门用来兼容四字节的unicode。好在utf8mb4是utf8的超集，除了将编码改为utf8mb4外不需要做其他转换。当然，为了节省空间，一般情况下使用utf8也就够了。二、内容描述那上面说了既然utf8能够存下大部分中文汉字,那为什么还要使用utf8mb4呢? 原来mysql支持的 utf8 编码最大字符长度为 3 字节，如果遇到 4 字节的宽字符就会插入异常了。三个字节的 UTF-8 最大能编码的 Unicode 字符是 0xffff，也就是 Unicode 中的基本多文种平面(BMP)。也就是说，任何不在基本多文本平面的 Unicode字符，都无法使用 Mysql 的 utf8 字符集存储。包括 Emoji 表情(Emoji 是一种特殊的 Unicode 编码，常见于 ios 和 android 手机上)，和很多不常用的汉字，以及任何新增的 Unicode 字符等等。三、问题根源最初的 UTF-8 格式使用一至六个字节，最大能编码 31 位字符。最新的 UTF-8 规范只使用一到四个字节，最大能编码21位，正好能够表示所有的 17个 Unicode 平面。 utf8 是 Mysql

Extract toUnicode map from One PDF and use in another

阅读更多关于 Extract toUnicode map from One PDF and use in another

问题 I have a Unicode PDF document which misses the toUnicode map. I have a different PDF with the same font which has the toUnicode map. Can I extract it from one PDF and use it to extract text from the other PDF? 回答1: The generic answer is no. The ToUnicode map you are talking about follows the PDF CMap format and is used to translate character codes into Unicode values. You face two potential pitfalls: 1) The fonts are not exactly the same. While their name may be the same, they might have a

Large emoji is cut off on chrome and mobile browsers

阅读更多关于 Large emoji is cut off on chrome and mobile browsers

问题 I am trying to display some large emojis with unicode in HTML or CSS. However, in Chrome, the lower part of some emojis are cut off, while others are displayed just fine. It also couldn't render in mobile Firefox on Android. Example of working emoji: ☀ Example of non-working emoji: 🌝 <!doctype html> <html lang="en"> <head> <meta charset="utf-8"> <style> body { margin: 0; background-color: #fff; } p { font-size: 20em; margin: 0; } </style> </head> <body> <p>🌝</p> <p>☀</p> </body> </html>

How to display a colored emoji

阅读更多关于 How to display a colored emoji

问题 My code: let myemoji = "\u{2049}" let another = "\u{2757}" Playground result: The Unicode U+2049 does not produce a red colored emoji like this ⁉️. Is there anything specific to be added for this color? 回答1: Some characters can be displayed "as text" or "as emoji", and a "Unicode VARIATION SELECTOR" can be used to control the presentation. Example: print("text presentation: \u{2049} \u{25B6}") print("emoji presentation: \u{2049}\u{FE0F} \u{25B6}\u{FE0F}") Result: For more information, see 1.4

订阅 unicode