utf-8

How to check if a string contain only UTF-8 characters

喜夏-厌秋 提交于 2020-08-07 08:15:08
问题 So far I am doing something like this: def is_utf8(s): try: x=bytes(s,'utf-8').decode('utf-8', 'strict') print(x) return 1 except: return 0 the only problem is that I don't want it to print anything, I want to delete the print(x) and when I do that, the function stops functioning correctly. For example if I do : print(is_utf8("H�tst")) while the print is in the function it returns 0 otherwise it prints 1. Am i approaching the problem in a wrong way 回答1: You could use the chardet module to

Using unicode character u201c

不想你离开。 提交于 2020-08-07 04:45:07
问题 I'm a new to python and am having problems understand unicode. I'm using Python 3.4. I've spent an entire day trying to figure this out by reading about unicode including http://www.fileformat.info/info/unicode/char/201C/index.htm and http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html I need to refer to special quotes because they are used in the text I'm analyzing. I did test that the W7 command window can read and write the 2 special quote characters. To

Why does this Python program send empty emails when I encode it with utf-8? [duplicate]

北慕城南 提交于 2020-08-06 07:22:14
问题 This question already has answers here : smtplib sends blank message if the message contain certain characters (3 answers) Closed 15 days ago . Before encoding the msg variable, I was getting this error: UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 4: ordinal not in range(128) So I did some research, and finally encoded the variable: msg = (os.path.splitext(base)[0] + ': ' + text).encode('utf-8') server.sendmail('...@gmail.com', '...@gmail.com', msg) Here's the

Why does this Python program send empty emails when I encode it with utf-8? [duplicate]

↘锁芯ラ 提交于 2020-08-06 07:21:11
问题 This question already has answers here : smtplib sends blank message if the message contain certain characters (3 answers) Closed 15 days ago . Before encoding the msg variable, I was getting this error: UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 4: ordinal not in range(128) So I did some research, and finally encoded the variable: msg = (os.path.splitext(base)[0] + ': ' + text).encode('utf-8') server.sendmail('...@gmail.com', '...@gmail.com', msg) Here's the

Regular expression - PCRE (PHP) - word boundary (\b) and accent characters

懵懂的女人 提交于 2020-07-31 03:55:05
问题 Why does the letter é count as a word boundary matching \b in the following example? Pattern: /\b(cum)\b/i Text: écumé Matches 'cum' which is not desired. Is it possible to overcome this? 回答1: It will work, when you add the u modifier to your regex /\b(cum)\b/iu 回答2: To deal with unicode, replace \b with /(?<=^|\PL)(cum)(?=\PL|$)/i 来源: https://stackoverflow.com/questions/22068702/regular-expression-pcre-php-word-boundary-b-and-accent-characters

How to convert Utf8 file to CP1252 by Unix

南楼画角 提交于 2020-07-22 12:47:05
问题 I'm trying to transform txt file encoding from UTF8 to ANSI (cp1252). I need this because the file is used in a fixed position Oracle import (external Table) which apparently only supports CP1252. If I import an UTF-8 file, some special characters turn up as two incorrect characters instead. I'm working in a Unix machine (my OS is HP UX). I have been looking for an answer on the web but I don't find any way to do this conversion. For exmple, the POSIX iconv command doesn't have this choose,

How do I avoid double UTF-8 encoding in XML::LibXML

雨燕双飞 提交于 2020-07-21 07:39:07
问题 My program receives UTF-8 encoded strings from a data source. I need to tamper with these strings, then output them as part of an XML structure. When I serialize my XML document, it will be double encoded and thus broken. When I serialize only the root element, it will be fine, but of course lacking the header. Here's a piece of code trying to visualize the problem: use strict; use diagnostics; use feature 'unicode_strings'; use utf8; use v5.14; use encoding::warnings; binmode(STDOUT, "

Reading UTF-8 characters from console

十年热恋 提交于 2020-07-18 05:05:48
问题 I'm trying to read UTF-8 encoded polish characters from console for my c++ application. I'm sure that console uses this code page (checked in properties). What I have already tried: Using cin - instead of "zażółć" I read "za\0\0\0\0" Using wcin - instead of "zażółć" - same result as with cin Using scanf - instead of 'zażółć\0' I read 'za\0\0\0\0\0' Using wscanf - same result as with scanf Using getchar to read characters one by one - same result as with scanf On the beginning of the main

UTF-8 - contradictory definitions

自作多情 提交于 2020-07-10 10:25:26
问题 My understanding of UTF-8 encoding is that the first byte of a UTF-8 char carries either data in the lower 7 bits (0-6) with high bit (7) clear for single byte ASCII range code-points data in the lower 5 bits (0-4), with high bits 7-5 = 110 to indicate a 2 byte char data in the lower 4 bits (0-3), with high bits 7-4 = 1110 to indicate a 3 byte char data in the lower 5 bits (0-2), with high bits 7-3 = 11110 to indicate a 4 byte char noting that bit 7 is always set and this tells utf-8 parsers

Python - How to convert utf literal such as '\xc3\xb6' to the letter ö

天大地大妈咪最大 提交于 2020-07-10 09:35:03
问题 I am trying to convert an encoded url with german Umlaute into a string with those Umlaute. Here is an example of an encoded string = 'K%C3%B6nnen'. I would like to convert it to 'Können' When I use urllib.unquote(a) I get this returned: 'K\xc3\xb6nnen' \xc3\xb6 I found out is a utf literal. How can I convert this to an ö ? I find that if I use the print function it converts it correctly, but I cannot figure out how to get a function to return this value? Any ideas? 回答1: With decode("utf-8")