character-encoding

Parse XML in Python with encoding other than utf-8

拥有回忆 提交于 2021-02-10 06:42:07
问题 Any clue on how to parse xml in python that has: encoding='Windows-1255' in it? At least the lxml.etree parser won't even look at the string when there's an "encoding" tag in the XML header which isn't "utf-8" or "ASCII". Running the following code fails with: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. from lxml import etree parser = etree.XMLParser(encoding='utf-8') def convert_xml_to_utf8(xml_str):

How to open Arabic text file with correct encoding in Visual Studio

霸气de小男生 提交于 2021-02-10 06:08:12
问题 I have a C# file that has some arabic text in it, I got the file from another source, the arabic text is now scrambled. looking like this ("ÇáãæÇÞÚ ÇáÞÇÈáÉ ááÊØæíÑ ÇáÓíÇÍì"), I tried to save the file in another encoding (UTF-8) but still same result, I desperately need to read this arabic text as this is the only back up we have Thanks 回答1: Try right-clicking the file in VS solution explorer, then choose: Open With... -> CSharp Editor with Encoding This should force VS to read the file with a

How to open Arabic text file with correct encoding in Visual Studio

耗尽温柔 提交于 2021-02-10 06:07:16
问题 I have a C# file that has some arabic text in it, I got the file from another source, the arabic text is now scrambled. looking like this ("ÇáãæÇÞÚ ÇáÞÇÈáÉ ááÊØæíÑ ÇáÓíÇÍì"), I tried to save the file in another encoding (UTF-8) but still same result, I desperately need to read this arabic text as this is the only back up we have Thanks 回答1: Try right-clicking the file in VS solution explorer, then choose: Open With... -> CSharp Editor with Encoding This should force VS to read the file with a

How to identify character encoding from website?

ぐ巨炮叔叔 提交于 2021-02-09 11:14:06
问题 What I'm trying to do: I'm getting from a database a list of uris and download them, removing the stopwords and counting the frequency that the words appears in the webpage, then trying to save in the mongodb. The Problem: When I try to save the result in the database I get the error bson.errors.invalidDocument: the document must be a valid utf-8 it appears to be related to the codes '\xc3someotherstrangewords', '\xe2something' when I'm processing the webpages I try remove the punctuation,

Is it possible to “sniff” the Character encoding?

我的梦境 提交于 2021-02-08 14:53:33
问题 I have a webpage that accepts CSV files. These files may be created in a variety of places. (I think) there is no way to specify the encoding in a CSV file - so I can not reliably treat all of them as utf-8 or any other encoding. Is there a way to intelligently guess the encoding of the CSV I am getting? I am working with Python, but willing to work with language agnostic methods too. 回答1: There is no correct way to determine the encoding of a file by looking at only the file itself, but you

Store Gtk.Textbuffer in SQL database. Encoding troubles

纵饮孤独 提交于 2021-02-08 11:51:34
问题 I'm working on a note taking app using python2/Gtk3/Glade . The notes are stored in a MySQL Database and displayed in a TextView widget . I can load/store/display plain text fine. However I want the ability to add images to the note page, and store them in the Database.so the data has to be serialised and I'm having some trouble figuring out how to encode/decode the serialised data going in and out of the Database. I'm getting unicode start byte errors. If was working with files I could just

Cleaning SQL “Incorrect string value” Error from PHP

一曲冷凌霜 提交于 2021-02-08 10:20:54
问题 I've seem this question a million times, but everyone seems to want to solve the problem in the database. I do not. I'm getting this error when parsing a large text file, picking out what I need and inserting it into my database. Out of 24 thousand rows or so, 30 or so have invalid characters in them. Here is an example of the error, followed by the query that caused it: [Query Error: Incorrect string value: '\xEF\xBC\x89' for column 'company' at row 1] [INSERT INTO mac_address_db_new (hex

Unable to change encoding of text files in Windows

旧巷老猫 提交于 2021-02-08 08:19:10
问题 I have some text files with different encodings. Some of them are UTF-8 and some others are windows-1251 encoded. I tried to execute following recursive script to encode it all to UTF-8 . Get-ChildItem *.nfo -Recurse | ForEach-Object { $content = $_ | Get-Content Set-Content -PassThru $_.Fullname $content -Encoding UTF8 -Force} After that I am unable to use files in my Java program, because UTF-8 encoded has also wrong encoding, I couldn't get back original text. In case of windows-1251

UnicodeEncodeError: 'gbk' codec can't encode character '\ue13b' in position 25: illegal multibyte sequence

折月煮酒 提交于 2021-02-08 05:35:55
问题 Error : UnicodeEncodeError: 'gbk' codec can't encode character '\ue13b' in position 25: illegal multibyte sequence The file encoding format is utf-8, and there is an unrecognized word in the file when it is read. ‘左足趾麻木’ Code : for line in open(label_filepath, encoding='utf-8'): print(line) 回答1: The error is happening when Python tries to print. When printing, that is writing to sys.stdout, Python encodes the text to be printed with the encoding expected by the terminal. In this case the

Determining text file encoding schema

僤鯓⒐⒋嵵緔 提交于 2021-02-08 05:03:35
问题 I am trying to create a method that can detect the encoding schema of a text file. I know there are many out there, but I know for sure my text file with be either ASCII , UTF-8 , or UTF-16 . I only need to detect these three. Anyone know a way to do this? 回答1: Use the StreamReader to identify the encoding. Example: using(var r = new StreamReader(filename, Encoding.Default)) { richtextBox1.Text = r.ReadToEnd(); var encoding = r.CurrentEncoding; } 回答2: First, open the file in binary mode and