python-unicode | 易学教程

Open() and codecs.open() in Python 2.7 behave strangely different

阅读更多关于 Open() and codecs.open() in Python 2.7 behave strangely different

I have a text file with first line of unicode characters and all other lines in ASCII. I try to read the first line as one variable, and all other lines as another. However, when I use the following code: # -*- coding: utf-8 -*- import codecs import os filename = '1.txt' f = codecs.open(filename, 'r3', encoding='utf-8') print f names_f = f.readline().split(' ') data_f = f.readlines() print len(names_f) print len(data_f) f.close() print 'And now for something completely differerent:' g = open(filename, 'r') names_g = g.readline().split(' ') print g data_g = g.readlines() print len(names_g)

How to write Russian characters in file?

阅读更多关于 How to write Russian characters in file?

In console when I'm trying output Russian characters It gives me ??????????????? Who know why? I tried write to file - in this case the same situation. for example f=open('tets.txt','w') f.write('some russian text') f.close inside file is - ?????????????????????????/ or p="some russian text" print p ????????????? In additional Notepad don't allow me to save file with Russian letters. I give this: This file contains characters in Unicode format which will be lost if you save this file as an ANSI encoded text file. To keep the Unicode information, click Cancel below and then select one of the

Python to show special characters

阅读更多关于 Python to show special characters

I know there are tons of threads regarding this issue but I have not managed to find one which solves my problem. I am trying to print a string but when printed it doesn't show special characters (e.g. æ, ø, å, ö and ü). When I print the string using repr() this is what I get: u'Von D\xc3\xbc' and u'\xc3\x96berg' Does anyone know how I can convert this to Von Dü and Öberg ? It's important to me that these characters are not ignored, e.g. myStr.encode("ascii", "ignore") . EDIT This is the code I use. I use BeautifulSoup to scrape a website. The contents of a cell ( <td> ) in a table ( <table> )

Python string argument without an encoding

阅读更多关于 Python string argument without an encoding

Am trying to a run this piece of code, and it keeps giving an error saying "String argument without an encoding" ota_packet = ota_packet.encode('utf-8') + bytearray(content[current_pos:(final_pos)]) + '\0'.encode('utf-8') Any help? You are passing in a string object to a bytearray() : bytearray(content[current_pos:(final_pos)]) You'll need to supply an encoding argument (second argument) so that it can be encoded to bytes. For example, you could encode it to UTF-8: bytearray(content[current_pos:(final_pos)], 'utf8') From the bytearray() documentation : The optional source parameter can be used

python url unquote followed by unicode decode

阅读更多关于 python url unquote followed by unicode decode

问题 I have a unicode string like '%C3%A7%C3%B6asd+fjkls%25asd' and I want to decode this string. I used urllib.unquote_plus(str) but it works wrong. expected : çöasd+fjkls%asd result : Ã§Ã¶asd fjkls%asd double coded utf-8 characters( %C3%A7 and %C3%B6 ) are decoded wrong. My python version is 2.7 under a linux distro. What is the best way to get expected result? 回答1: You have 3 or 4 or 5 problems ... but repr() and unicodedata.name() are your friends; they unambiguously show you exactly what you

Google App Engine: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 48: ordinal not in range(128)

阅读更多关于 Google App Engine: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 48: ordinal not in range(128)

I'm working on a small application using Google App Engine which makes use of the Quora RSS feed. There is a form, and based on the input entered by the user, it will output a list of links related to the input. Now, the applications works fine for one letter queries and most of two-letter words if the words are separated by a '-'. However, for three-letter words and some two-letter words, I get the following error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 48: ordinal not in range(128) Here's my Python code: import os import webapp2 import jinja2 from google

Python 3: os.walk() file paths UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed

阅读更多关于 Python 3: os.walk() file paths UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed

This code: for root, dirs, files in os.walk('.'): print(root) Gives me this error: UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 27: surrogates not allowed How do I walk through a file tree without getting toxic strings like this? On Linux, filenames are 'just a bunch of bytes', and are not necessarily encoded in a particular encoding. Python 3 tries to turn everything into Unicode strings. In doing so the developers came up with a scheme to translate byte strings to Unicode strings and back without loss, and without knowing the original encoding. They used

Print unicode string to console OK but fails when redirect to a file. How to fix?

阅读更多关于 Print unicode string to console OK but fails when redirect to a file. How to fix?

问题 I have Python 2.7.1 on a Simplified-Chinese version of Windows XP, and I have a program like this(windows_prn_utf8.py): #!/usr/bin/env python # -*- coding: utf8 -*- print unicode('\xE7\x94\xB5', 'utf8') If I run it on Windows CMD console, it output the right Chinese character '电' ; however, if I try to redirect the command output to a file. I got error. D:\Temp>windows_prn_utf8.py > 1.txt Traceback (most recent call last): File "D:\Temp\windows_prn_utf8.py", line 4, in <module> print unicode(

Python Latin Characters and Unicode

阅读更多关于 Python Latin Characters and Unicode

问题 I have a tree structure in which keywords may contain some latin characters. I have a function which loops through all leaves of the tree and adds each keyword to a list under certain conditions. Here is the code I have for adding these keywords to the list: print "Adding: " + self.keyword leaf_list.append(self.keyword) print leaf_list If the keyword in this case is université , then my output is: Adding: université ['universit\xc3\xa9'] It appears that the print function properly shows the

Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte

阅读更多关于 Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte

问题 I'm running a large number of OCRs on screenshots with Pytesseract. This is working well in most cases, but a small number is causing this error: pytesseract.image_to_string(image,None, False, "-psm 6") Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to <undefined> I'm using Python 3.4. Any suggestions how I can prevent this error from happening (other than just a try/except) would be very helpful. 回答1: Use Unidecode from unidecode import