python-unicode

Open() and codecs.open() in Python 2.7 behave strangely different

旧巷老猫 提交于 2019-11-29 10:31:16
I have a text file with first line of unicode characters and all other lines in ASCII. I try to read the first line as one variable, and all other lines as another. However, when I use the following code: # -*- coding: utf-8 -*- import codecs import os filename = '1.txt' f = codecs.open(filename, 'r3', encoding='utf-8') print f names_f = f.readline().split(' ') data_f = f.readlines() print len(names_f) print len(data_f) f.close() print 'And now for something completely differerent:' g = open(filename, 'r') names_g = g.readline().split(' ') print g data_g = g.readlines() print len(names_g)

How to write Russian characters in file?

拈花ヽ惹草 提交于 2019-11-29 09:58:19
In console when I'm trying output Russian characters It gives me ??????????????? Who know why? I tried write to file - in this case the same situation. for example f=open('tets.txt','w') f.write('some russian text') f.close inside file is - ?????????????????????????/ or p="some russian text" print p ????????????? In additional Notepad don't allow me to save file with Russian letters. I give this: This file contains characters in Unicode format which will be lost if you save this file as an ANSI encoded text file. To keep the Unicode information, click Cancel below and then select one of the

Python to show special characters

有些话、适合烂在心里 提交于 2019-11-29 07:46:33
I know there are tons of threads regarding this issue but I have not managed to find one which solves my problem. I am trying to print a string but when printed it doesn't show special characters (e.g. æ, ø, å, ö and ü). When I print the string using repr() this is what I get: u'Von D\xc3\xbc' and u'\xc3\x96berg' Does anyone know how I can convert this to Von Dü and Öberg ? It's important to me that these characters are not ignored, e.g. myStr.encode("ascii", "ignore") . EDIT This is the code I use. I use BeautifulSoup to scrape a website. The contents of a cell ( <td> ) in a table ( <table> )

Python string argument without an encoding

走远了吗. 提交于 2019-11-29 02:48:54
Am trying to a run this piece of code, and it keeps giving an error saying "String argument without an encoding" ota_packet = ota_packet.encode('utf-8') + bytearray(content[current_pos:(final_pos)]) + '\0'.encode('utf-8') Any help? You are passing in a string object to a bytearray() : bytearray(content[current_pos:(final_pos)]) You'll need to supply an encoding argument (second argument) so that it can be encoded to bytes. For example, you could encode it to UTF-8: bytearray(content[current_pos:(final_pos)], 'utf8') From the bytearray() documentation : The optional source parameter can be used

python url unquote followed by unicode decode

雨燕双飞 提交于 2019-11-29 01:23:18
问题 I have a unicode string like '%C3%A7%C3%B6asd+fjkls%25asd' and I want to decode this string. I used urllib.unquote_plus(str) but it works wrong. expected : çöasd+fjkls%asd result : çöasd fjkls%asd double coded utf-8 characters( %C3%A7 and %C3%B6 ) are decoded wrong. My python version is 2.7 under a linux distro. What is the best way to get expected result? 回答1: You have 3 or 4 or 5 problems ... but repr() and unicodedata.name() are your friends; they unambiguously show you exactly what you

Google App Engine: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 48: ordinal not in range(128)

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-29 01:02:31
I'm working on a small application using Google App Engine which makes use of the Quora RSS feed. There is a form, and based on the input entered by the user, it will output a list of links related to the input. Now, the applications works fine for one letter queries and most of two-letter words if the words are separated by a '-'. However, for three-letter words and some two-letter words, I get the following error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 48: ordinal not in range(128) Here's my Python code: import os import webapp2 import jinja2 from google

Python 3: os.walk() file paths UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed

只谈情不闲聊 提交于 2019-11-28 21:26:32
This code: for root, dirs, files in os.walk('.'): print(root) Gives me this error: UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 27: surrogates not allowed How do I walk through a file tree without getting toxic strings like this? On Linux, filenames are 'just a bunch of bytes', and are not necessarily encoded in a particular encoding. Python 3 tries to turn everything into Unicode strings. In doing so the developers came up with a scheme to translate byte strings to Unicode strings and back without loss, and without knowing the original encoding. They used

Print unicode string to console OK but fails when redirect to a file. How to fix?

杀马特。学长 韩版系。学妹 提交于 2019-11-28 11:05:00
问题 I have Python 2.7.1 on a Simplified-Chinese version of Windows XP, and I have a program like this(windows_prn_utf8.py): #!/usr/bin/env python # -*- coding: utf8 -*- print unicode('\xE7\x94\xB5', 'utf8') If I run it on Windows CMD console, it output the right Chinese character '电' ; however, if I try to redirect the command output to a file. I got error. D:\Temp>windows_prn_utf8.py > 1.txt Traceback (most recent call last): File "D:\Temp\windows_prn_utf8.py", line 4, in <module> print unicode(

Python Latin Characters and Unicode

ぐ巨炮叔叔 提交于 2019-11-28 10:25:26
问题 I have a tree structure in which keywords may contain some latin characters. I have a function which loops through all leaves of the tree and adds each keyword to a list under certain conditions. Here is the code I have for adding these keywords to the list: print "Adding: " + self.keyword leaf_list.append(self.keyword) print leaf_list If the keyword in this case is université , then my output is: Adding: université ['universit\xc3\xa9'] It appears that the print function properly shows the

Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte

早过忘川 提交于 2019-11-28 09:14:30
问题 I'm running a large number of OCRs on screenshots with Pytesseract. This is working well in most cases, but a small number is causing this error: pytesseract.image_to_string(image,None, False, "-psm 6") Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to <undefined> I'm using Python 3.4. Any suggestions how I can prevent this error from happening (other than just a try/except) would be very helpful. 回答1: Use Unidecode from unidecode import