utf-8 | 易学教程

How to check if a string contain only UTF-8 characters

阅读更多关于 How to check if a string contain only UTF-8 characters

问题 So far I am doing something like this: def is_utf8(s): try: x=bytes(s,'utf-8').decode('utf-8', 'strict') print(x) return 1 except: return 0 the only problem is that I don't want it to print anything, I want to delete the print(x) and when I do that, the function stops functioning correctly. For example if I do : print(is_utf8("H�tst")) while the print is in the function it returns 0 otherwise it prints 1. Am i approaching the problem in a wrong way 回答1: You could use the chardet module to

Using unicode character u201c

阅读更多关于 Using unicode character u201c

问题 I'm a new to python and am having problems understand unicode. I'm using Python 3.4. I've spent an entire day trying to figure this out by reading about unicode including http://www.fileformat.info/info/unicode/char/201C/index.htm and http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html I need to refer to special quotes because they are used in the text I'm analyzing. I did test that the W7 command window can read and write the 2 special quote characters. To

Why does this Python program send empty emails when I encode it with utf-8? [duplicate]

阅读更多关于 Why does this Python program send empty emails when I encode it with utf-8? [duplicate]

问题 This question already has answers here : smtplib sends blank message if the message contain certain characters (3 answers) Closed 15 days ago . Before encoding the msg variable, I was getting this error: UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 4: ordinal not in range(128) So I did some research, and finally encoded the variable: msg = (os.path.splitext(base)[0] + ': ' + text).encode('utf-8') server.sendmail('...@gmail.com', '...@gmail.com', msg) Here's the

Why does this Python program send empty emails when I encode it with utf-8? [duplicate]

阅读更多关于 Why does this Python program send empty emails when I encode it with utf-8? [duplicate]

Regular expression - PCRE (PHP) - word boundary (\b) and accent characters

阅读更多关于 Regular expression - PCRE (PHP) - word boundary (\b) and accent characters

问题 Why does the letter é count as a word boundary matching \b in the following example? Pattern: /\b(cum)\b/i Text: écumé Matches 'cum' which is not desired. Is it possible to overcome this? 回答1: It will work, when you add the u modifier to your regex /\b(cum)\b/iu 回答2: To deal with unicode, replace \b with /(?<=^|\PL)(cum)(?=\PL|$)/i 来源： https://stackoverflow.com/questions/22068702/regular-expression-pcre-php-word-boundary-b-and-accent-characters

How to convert Utf8 file to CP1252 by Unix

阅读更多关于 How to convert Utf8 file to CP1252 by Unix

问题 I'm trying to transform txt file encoding from UTF8 to ANSI (cp1252). I need this because the file is used in a fixed position Oracle import (external Table) which apparently only supports CP1252. If I import an UTF-8 file, some special characters turn up as two incorrect characters instead. I'm working in a Unix machine (my OS is HP UX). I have been looking for an answer on the web but I don't find any way to do this conversion. For exmple, the POSIX iconv command doesn't have this choose,

How do I avoid double UTF-8 encoding in XML::LibXML

阅读更多关于 How do I avoid double UTF-8 encoding in XML::LibXML

问题 My program receives UTF-8 encoded strings from a data source. I need to tamper with these strings, then output them as part of an XML structure. When I serialize my XML document, it will be double encoded and thus broken. When I serialize only the root element, it will be fine, but of course lacking the header. Here's a piece of code trying to visualize the problem: use strict; use diagnostics; use feature 'unicode_strings'; use utf8; use v5.14; use encoding::warnings; binmode(STDOUT, "

Reading UTF-8 characters from console

阅读更多关于 Reading UTF-8 characters from console

问题 I'm trying to read UTF-8 encoded polish characters from console for my c++ application. I'm sure that console uses this code page (checked in properties). What I have already tried: Using cin - instead of "zażółć" I read "za\0\0\0\0" Using wcin - instead of "zażółć" - same result as with cin Using scanf - instead of 'zażółć\0' I read 'za\0\0\0\0\0' Using wscanf - same result as with scanf Using getchar to read characters one by one - same result as with scanf On the beginning of the main

UTF-8 - contradictory definitions

阅读更多关于 UTF-8 - contradictory definitions

问题 My understanding of UTF-8 encoding is that the first byte of a UTF-8 char carries either data in the lower 7 bits (0-6) with high bit (7) clear for single byte ASCII range code-points data in the lower 5 bits (0-4), with high bits 7-5 = 110 to indicate a 2 byte char data in the lower 4 bits (0-3), with high bits 7-4 = 1110 to indicate a 3 byte char data in the lower 5 bits (0-2), with high bits 7-3 = 11110 to indicate a 4 byte char noting that bit 7 is always set and this tells utf-8 parsers

Python - How to convert utf literal such as '\xc3\xb6' to the letter ö

阅读更多关于 Python - How to convert utf literal such as '\xc3\xb6' to the letter ö

问题 I am trying to convert an encoded url with german Umlaute into a string with those Umlaute. Here is an example of an encoded string = 'K%C3%B6nnen'. I would like to convert it to 'Können' When I use urllib.unquote(a) I get this returned: 'K\xc3\xb6nnen' \xc3\xb6 I found out is a utf literal. How can I convert this to an ö ? I find that if I use the print function it converts it correctly, but I cannot figure out how to get a function to return this value? Any ideas? 回答1: With decode("utf-8")