unicode | 易学教程

python中编码问题

阅读更多关于 python中编码问题

写在前面：本文是很基础的东西，这些基础的东西有一个特点，看一遍会了，但其中很多精髓其实被忽略了，建议你货比三家，细细品尝编码之美。还有，这文章是我熬夜写的，可能有错，请批判性阅读，谢谢。 0x00:为社么会出现多种编码？相信计算机专业的都知道，所有的数据（文本，音频，视频等等）在计算机内部都是以二进制形式来表示的。而计算机内部为什么采用二进制则是由硬件决定的（计算机采用了具有两种稳定状态的二值电路）。这样，就引出一个问题：我们人类不适合直接看二进制。因此，需要用一种方法，将二进制转为我们能看懂的东西。编码就应运而生了。 0x01:编码发展历史第一阶段：在计算机中，所有的数据只可能是0或者1(用高电平和低电平分别表示1和0)，那么我们通常看到的字符也就只能用0和1来表示呀。于是科学家们(这里指的是美国的科学家)就想出一个办法，把一个特定的数字对应一个特定的字母进行存储和传输，比如我需要存储字母a，那么我存入一个数字97(即在计算机中存入二进制(01100001)，这个过程叫做编码(encode)，而我们在读取数据的时候，当遇到97时，我们就让计算机显示字母a，这个过程叫做解码(decode)。这里你应该知道：计算机看懂的东西我们看不懂，我们看懂的东西，计算机看不懂。把计算机看懂的东西（二进制(01100001)）变成我们看懂的东西（数字97，也就是a）

Using unicode characters as shape

阅读更多关于 Using unicode characters as shape

问题 I'd like to use unicode characters as the shape of plots in ggplot, but for unknown reason they're not rendering. I did find a similar query here, but I can't make the example there work either. Any clues as to why? Note that I don't want to use the unicode character as a "palette", I want each item plotted by geom_point() to be the same shape (color will indicate the relevant variable). Running Sys.setenv(LANG = "en_US.UTF-8") and restarting R does not help. Wrapping the unicode in sprintf()

Using unicode characters as shape

阅读更多关于 Using unicode characters as shape

Julia ---- String 字符串类型常用操作

阅读更多关于 Julia ---- String 字符串类型常用操作

1、字符类型String的一些特点 Julia Strings 有几个值得注意的高级特征： (1)Julia中用于字符串（和字符串文本）处理的的内置类型是string。它使用UTF-8编码，并支持所有的的Unicode字符。（提供了transcode()函数，用于转换为其他程序的Unicode编码或从其他程序的Unicode编码转换为自己的Unicode编码。) (2)所有字符串类型都是抽象类型abstract string的子类型，其他外部包也会定义额外的抽象字符串子类型（例如，用于其他编码）。如果定义的函数需要字符串参数，则应将参数类型声明为AbstractString，以便接受其他的字符串类型。 (3)像C和Java一样，但是与大多数动态语言不同，Julia有一个表示单个字符的一级类型，称为Char。这只是一种特殊的32位原语类型，其数值表示Unicode编码值。 (4)与Java一样，Julia的字符串是不可变的：AbstractString对象的值不能更改。要使用不同的字符串，可以从其他字符串的部分构造新字符串。 (5)从概念上讲，字符串在存储上类似字符数组，所以它每一位的单个元素都是可以提取的：对于某些索引值，如果不返回字符值，就会引发异常。它允许通过编码表示的字节索引而不是通过由字符索引来高效地对字符串进行索引

Strip special characters and punctuation from a unicode string

阅读更多关于 Strip special characters and punctuation from a unicode string

问题 I'm trying to remove the punctuation from a unicode string, which may contain non-ascii letters. I tried using the regex module: import regex text = u"<Üäik>" regex.sub(ur"\p{P}+", "", text) However, I've noticed that the characters < and > don't get removed. Does anyone know why and is there any other way to strip punctuation from unicode strings? EDIT: Another approach I've tried out is doing: import string text = text.encode("utf8").translate(None, string.punctuation).decode("utf8") but I

Detect single string fraction (ex: ½ ) and change it to longer string?

阅读更多关于 Detect single string fraction (ex: ½ ) and change it to longer string?

问题 ex: "32 ½ is not very hot " to x = "info: 32, numerator = 1, denominator = 2" Note: it could be 3/9 , but it cannot be simplified into 1/3 aka literally get what is in the string. I need to detect the fractional string in a longer string and expand the information to a more usable form. ½ has been given to me decoded and is a string with length one. 回答1: There seem to be 19 such forms (here) and they all start with the name VULGAR FRACTION. import unicodedata def fraction_finder(s): for c in

Converting Mac Roman character to equivalent UTF-8

阅读更多关于 Converting Mac Roman character to equivalent UTF-8

问题 I have been given some HTML files that use the Mac OS Roman file encoding. The files have French text, but in an editor many of the diacritical chars look strange (i.e. non French) Si cette option est sÈlectionnÈe, <removed> tentera de communiquer avec votre tÈlescope seulement ‡ líaide díun ... The capital E with accent does display properly in the browser as é as do the other strange characters. I also have some UTF-8 French files that look normal in an editor (é looks like é). What I'd

Regex to Match Horizontal White Spaces

阅读更多关于 Regex to Match Horizontal White Spaces

问题 I need a regex in Python2 to match only horizontal white spaces not newlines. \s matches all whitespaces including newlines. >>> re.sub(r"\s", "", "line 1.\nline 2\n") 'line1.line2' \h does not work at all. >>> re.sub(r"\h", "", "line 1.\nline 2\n") 'line 1.\nline 2\n' [\t ] works but I am not sure if I am missing other possible white space characters especially in Unicode. Such as \u00A0 (non breaking space) or \u200A (hair space). There are much more white space characters at the following

Regex to Match Horizontal White Spaces

阅读更多关于 Regex to Match Horizontal White Spaces

Remove invalid UTF-8 characters from a string

阅读更多关于 Remove invalid UTF-8 characters from a string

问题 I get this on json.Marshal of a list of strings: json: invalid UTF-8 in string: "...ole\xc5\" The reason is obvious, but how can I delete/replace such strings in Go? I've been reading docst on unicode and unicode/utf8 packages and there seems no obvious/quick way to do it. In Python for example you have methods for it where the invalid characters can be deleted, replaced by a specified character or strict setting which raises exception on invalid chars. How can I do equivalent thing in Go?