unicode | 易学教程

UnicodeDecodeError when logging an Exception in Python

阅读更多关于 UnicodeDecodeError when logging an Exception in Python

问题 I'm using Python 2.7.9. x32 on Win7 x64. When I'm logging an Exception containing Umlauts, I always receive UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 39: ordinal not in range(128) My example code is: except Exception as e: logging.error('Error loading SCMTool for repository ' '%s (ID %d): %s' % (repo.name, repo.id, e), exc_info=1) The Exception being logged is WindowsError: [Error 267] Der Verzeichnisname ist ungültig . The Problem is based on the "ung Ü ltig"

Why doesn't Perl v5.22 find all the sentence boundaries?

阅读更多关于 Why doesn't Perl v5.22 find all the sentence boundaries?

问题 This is fixed in Perl 5.22.1. I write about it in Perl v5.22 adds fancy Unicode word boundaries. Perl v5.22 added the Unicode assertions from TR #29. I've been playing with the sentence boundary assertion, but it only seems to find the start and end of text: use v5.22; $_ = "See Spot. (Spot is a dog.) See Spot run. Run Spot, run!\x{2029}New paragraph."; while( m/\b{sb}/g ) { say "Sentence boundary at ", pos; } The output notes sentence boundaries at the start and end of text, but not after

Why doesn't Perl v5.22 find all the sentence boundaries?

阅读更多关于 Why doesn't Perl v5.22 find all the sentence boundaries?

Oracle PLSQL equivalent of ASCIISTR(N'str')

阅读更多关于 Oracle PLSQL equivalent of ASCIISTR(N'str')

问题 My database has NLS_LANGUAGE:AMERICAN / NLS_CHARACTERSET:WE8ISO8859P15 / NLS_NCHAR_CHARACTERSET:AL16UTF16 ; NLS_LANG is set to AMERICAN_AMERICA.WE8MSWIN1252 in my Windows> properties> advanced system settings> advanced> environment variables - hope it applies to my PLSQL Dev. I use ASCIISTR to get a unicode encoded value for exotic chars like this: SELECT ASCIISTR(N'κόσμε') FROM DUAL; Results in ASCIISTR(UNISTR('\03BA\1F79\03... --------------------------------- \03BA\1F79\03C3\03BC\03B5 It

Optimal function to create a random UTF-8 string in PHP? (letter characters only)

阅读更多关于 Optimal function to create a random UTF-8 string in PHP? (letter characters only)

问题 I wrote this function that creates a random string of UTF-8 characters. It works well, but the regular expression [^\p{L}] is not filtering all non-letter characters it seems. I can't think of a better way to generate the full range of unicode without non-letter characters.. short of manually searching for and defining the decimal letter ranges between 65 and 65533. function rand_str($max_length, $min_length = 1, $utf8 = true) { static $utf8_chars = array(); if ($utf8 && !$utf8_chars) { for (

AJAX request returning unicode characters as question marks

阅读更多关于 AJAX request returning unicode characters as question marks

问题 I have the following PHP script being called by AJAX: <?php // file /ajax/loopback.php $fp = fopen("php://input","r"); $pdt = ""; while(!feof($fp)) $pdt .= fgets($fp); fclose($fp); $_POST = json_decode($pdt,true); if( !$_POST) $_POST = Array(); var_dump($_POST); exit; ?> I then call this script with the following JavaScript: AJAX = function(url,data,callback) { var a = new XMLHttpRequest(); if( data) { data = JSON.stringify(data); } a.open("POST","/ajax/"+url,true); a.onreadystatechange =

javascript 核心语言笔记- 2 语法结构

阅读更多关于 javascript 核心语言笔记- 2 语法结构

字符集 JavasSript 程序是用 Unicode 字符集编写的，Unicode 是 ASCII 和 Latin-1 的超集，支持几乎所有在用的语言。ECMAScript 3 要求 JavaScript 的实现必须支持 Unicode 2.1 及后续版本，ECMAScript 5 则要求支持 Unicode 3 及其以后的版本区分大小写 JavaScript 是区分大小写的。关键字、变量、函数名和所有的标识符（identifier）都必须采取一致的大小写形式需要注意的是 HTML, HTML 5（标签、属性名）并不区分大小写，XHTML 是区分大小写的，但是现代浏览器通常有容错能力，即使标签名、属性名大小写乱用也会正常解析。特别注意 HTML 标签的属性值是区分大小写的，比如 <div class="warp Warp"></div> 空格、换行和格式控制符号 JavaScript 会忽略程序中标识（token）之间的空格。多数情况下，JavaScript 会忽略换行符。 JavaScript 会识别下面的空白字符普通空格字符（\u0020）水平制表符（\u0009）垂直制表符（\u000b）换页符（\u000c）不中断空白符（\u00a0）字节序标记（\ufeff） JavaScript 会识别下面的字符识别为行结束符换行符（\u000a）回车符（

字符编码与转码

阅读更多关于字符编码与转码

需知在python2默认编码是ASCII, python3里默认是unicode 在py3中encode,在转码的同时还会把string 变成bytes类型，decode在解码的同时还会把bytes变回string 转换原则所有的编码都需要unicode作为中介来转换 utf-8转换程gb2312 首先通过解码【decode】转换成unicode编码其次通过编码【encode】转换成gb2312编码 gb2312转换程utf-8 首先通过解码【decode】转换成unicode编码其次通过编码【encode】转换成utf-8编码实战（python3） import sys , time print ( '系统默认\t' , sys . getdefaultencoding ( ) ) #系统默认编码 str = '庆余年很好看哈' #字符串的编码是unicode str_utf8 = str . encode ( 'utf-8' ) str_gb2312 = str_utf8 . decode ( 'utf-8' ) . encode ( 'gb2312' ) #通过unicode转换 str_gbk = str . encode ( 'gbk' ) print ( 'unicode\t' , str ) print ( 'utf-8\t' , str_utf8 )

python2.7编码与解码

阅读更多关于 python2.7编码与解码

常见的编码　　ASCII: 美国人发明的，只编码英文字母和符号，1个字节。　　GB2312: 中国人发明的，增加了中文汉字和符号，2个字节。　　Unicode: 为了把所有语言都统一到一套编码里，一般是2个字节，生僻字4个字节。　　UTF-8：为了节省英文字符内存空间，UTF-8可变长编码，常用的英文字母被编码成1个字节，汉字通常是3个字节，生僻的字符编码成4-6个字节。 1 >>> S = '中文' 2 >>> print type(S), len(S) 3 <type 'str'> 4 4 5 >>> unicodeS = u'中文' 6 >>> print type(unicodeS), len(unicodeS) 7 <type 'unicode'> 2 8 9 >>> utfS = u'中文'.encode('utf-8') 10 >>> print type(utfS), len(utfS) 11 <type 'str'> 6 　　在计算机内存中，统一使用Unicode编码，当需要保存到硬盘或者需要传输的时候，就转换为UTF-8编码，这样可以节省很多存储空间。　　其中，python2和python3系统默认编码分别是ascii和utf-8，以python2.7为例： 1 >>> import sys 2 >>> sys.getdefaultencoding

ANSI, UNICODE,UTF8编码的区别

阅读更多关于 ANSI, UNICODE,UTF8编码的区别

本地化过程中涉及到源文件和目标文件的传输问题，这时候编码就显得很重要。中文的网页和操作系统中通常采用ANSI编码，这也是微软OS的一个字符标准。对于ANSI，不同的国家和地区制定了不同的标准，由此产生了GB2312（简体中文），BIG5（繁体中文），JIS（日文）等各自的编码标准。但不同的ANSI编码在不同语言之间是不兼容的，所以对于不同的操作系统之间文件的传输，或者在同样的操作系统下，源文件语言不同于OS的语言文件的传输，需要转换成UT8格式。具体区别： ANSI:16384 个字符。这就是 ANSI 字符标准。英文一个字节，中文两个字节 UNICODE：使用两个字节对世界上几乎所有的语言进行编码（ 0x0000 － 0xFFFF ）， 65536 个字符，每种语言的代码段不同，两个字节 ( 英文、中文都是两个字节 ) 所表达的字符是唯一的，所以不同语种可以共存于文本中，解决国际化的问题 UTF8 是 Unicode 一种压缩形式，英文 A 在 unicode 中表示为 0x0041 ，老外觉得这种存储方式太浪费，因为浪费了 50% 的空间，于是就把英文压缩成 1 个字节，成了 utf8 编码，但是汉字在 utf8 中占 3 个字节，显然用做中文不如 ansi 合算，这就是中国的网页用作 ansi 编码而老外的网页常用 utf8 的原因。在英文OS中

订阅 unicode