mojibake

Converting special charactes such as ü and à back to their original, latin alphbet counterparts in C#

假装没事ソ 提交于 2019-12-03 01:52:47
I have been given an export from a MySQL database that seems to have had it's encoding muddled somewhat over time and contains a mix of HTML char codes such as & uuml; and more problematic characters representing the same letters such as ü and à . It is my task to to bring some consistency back to the file and get everything into the correct Latin characters, e.g. ú and ó . An example of the sort of string I am dealing with is Desinfektionslösungstücher für Flächen Which should equate to 50 Tattoo Desinfektionsl ö sungst ü cher f ü r Fl ä chen 50 Tattoo Desinfektionsl ö sungst

Convert unicode with utf-8 string as content to str

柔情痞子 提交于 2019-12-02 21:04:25
I'm using pyquery to parse a page: dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'}) content = dom('#mw-content-text > p').eq(0).text() but what I get in content is a unicode string with utf-8 encoded content: u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...' how could I convert it to str without lost the content? to make it clear: I want conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8' not conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8' If you have a unicode value

PHP Strange character before £ sign?

筅森魡賤 提交于 2019-12-02 05:16:36
For some reason i get a  £76756687 weird character when i type a £ into a text field on my form? John Parker As you suspect, it's a character encoding issue - is the page set to use a charset of UTF-8? (You can't go wrong with this encoding really.) Also, you'll probably want to entity encode the pound symbol on the way out ( £ ) As an example character set (for both the form page and HTML email) you could use: <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> That said, is there a good reason for the user to have to enter the currency symbol? Would it be a better idea to

Fixing mojibakes in UTF-8 text

怎甘沉沦 提交于 2019-12-02 00:30:36
I have a file with text in Portuguese in UTF-8. Somehow, who produced the file selected the wrong encoding, and the text is full of mojibake : IDENTIFICAÌàÌÄO instead of identificação André instead of André Automated tools do not see anything wrong with the file. I tried to fix it with Python package ftfy to no avail. How can I fix this file, apart from replacing all incorrect characters manually? "André" instead of "André" is the Latin-1 interpretation of UTF-8 encoding. You can fix it by inverting the encoding/decoding: >>> 'André'.encode('latin-1').decode('utf-8') 'André' All cases

PHP Ansi to UTF-8

柔情痞子 提交于 2019-11-30 16:34:45
I'm trying to create a script in PHP for converting some files to UTF-8. I have a file in Greek, where Notepad++ indicates that it ahs "ANSI" encoding. When I upload it to the server, it detects it's encoding as UTF-8 (something wrinf i think). Then when I convert it's contents to UTF-8 with utf8_encode () and download the new file, the characters are messed up. I tried to remove the BOM with PHP and the result was the same. I tried to remove the BOM with PHP without converting the file to UTF-8 but the file remained in ANSI encoding, without messed up characters. How can I fix that? OZ_

PHP Ansi to UTF-8

自作多情 提交于 2019-11-29 23:54:30
问题 I'm trying to create a script in PHP for converting some files to UTF-8. I have a file in Greek, where Notepad++ indicates that it ahs "ANSI" encoding. When I upload it to the server, it detects it's encoding as UTF-8 (something wrinf i think). Then when I convert it's contents to UTF-8 with utf8_encode () and download the new file, the characters are messed up. I tried to remove the BOM with PHP and the result was the same. I tried to remove the BOM with PHP without converting the file to

nodejs synchronization read large file line by line?

懵懂的女人 提交于 2019-11-29 18:45:11
问题 I have a large file (utf8). I know fs.createReadStream can create stream to read a large file, but not synchronized. So i try to use fs.readSync , but read text is broken like "迈�" . var fs = require('fs'); var util = require('util'); var textPath = __dirname + '/people-daily.txt'; var fd = fs.openSync(textPath, "r"); var text = fs.readSync(fd, 4, 0, "utf8"); console.log(util.inspect(text, true, null)); 回答1: For large files, readFileSync can be inconvenient, as it loads the whole file in

Unbaking mojibake

流过昼夜 提交于 2019-11-29 08:39:54
When you have incorrectly decoded characters, how can you identify likely candidates for the original string? Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png I know for a fact that this image filename should have been some Japanese characters. But with various guesses at urllib quoting/unquoting, encode and decode iso8859-1, utf8, I haven't been able to unmunge and get the original filename. Is the corruption reversible? galinden You could use chardet (install with pip): import chardet your_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb" detected_encoding = chardet.detect(your_str)["encoding"] try: correct_str =

In what world would \\\\u00c3\\\\u00a9 become é?

浪子不回头ぞ 提交于 2019-11-29 07:09:36
I have a likely improperly encoded json document from a source I do not control, which contains the following strings: d\u00c3\u00a9cor business\u00e2\u20ac\u2122 active accounts the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label From this, I am gathering they intend for \u00c3\u00a9 to beceom é , which would be utf-8 hex C3 A9 . That makes some sense. For the others, I assume we are dealing with some types of directional quotation marks. My theory here is that this is either using some encoding I've never encountered before, or that it has been double-encoded in some way. I am fine

Python correct encoding of Website (Beautiful Soup)

时光总嘲笑我的痴心妄想 提交于 2019-11-29 00:15:06
I am trying to load a html-page and output the text, even though i am getting the webpage correctly, BeautifulSoup destroys somehow the encoding. Source: # -*- coding: utf-8 -*- import requests from BeautifulSoup import BeautifulSoup url = "http://www.columbia.edu/~fdc/utf8/" r = requests.get(url) encodedText = r.text.encode("utf-8") soup = BeautifulSoup(encodedText) text = str(soup.findAll(text=True)) print text.decode("utf-8") Excerpt Output: ...Odenw\xc3\xa4lderisch... this should be Odenwälderisch You are making two mistakes; you are mis-handling encoding, and you are treating a result