mojibake | 易学教程

Converting special charactes such as Ã¼ and Ãƒ back to their original, latin alphbet counterparts in C#

阅读更多关于 Converting special charactes such as Ã¼ and Ãƒ back to their original, latin alphbet counterparts in C#

I have been given an export from a MySQL database that seems to have had it's encoding muddled somewhat over time and contains a mix of HTML char codes such as & uuml; and more problematic characters representing the same letters such as Ã¼ and Ãƒ . It is my task to to bring some consistency back to the file and get everything into the correct Latin characters, e.g. ú and ó . An example of the sort of string I am dealing with is DesinfektionslÃƒÂ¶sungstÃƒÂ¼cher fÃƒÂ¼r FlÃƒÂ¤chen Which should equate to 50 Tattoo Desinfektionsl ö sungst ü cher f ü r Fl ä chen 50 Tattoo Desinfektionsl ÃƒÂ¶ sungst

Convert unicode with utf-8 string as content to str

阅读更多关于 Convert unicode with utf-8 string as content to str

I'm using pyquery to parse a page: dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'}) content = dom('#mw-content-text > p').eq(0).text() but what I get in content is a unicode string with utf-8 encoded content: u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...' how could I convert it to str without lost the content? to make it clear: I want conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8' not conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8' If you have a unicode value

PHP Strange character before £ sign?

阅读更多关于 PHP Strange character before £ sign?

For some reason i get a Â £76756687 weird character when i type a £ into a text field on my form? John Parker As you suspect, it's a character encoding issue - is the page set to use a charset of UTF-8? (You can't go wrong with this encoding really.) Also, you'll probably want to entity encode the pound symbol on the way out ( £ ) As an example character set (for both the form page and HTML email) you could use: <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> That said, is there a good reason for the user to have to enter the currency symbol? Would it be a better idea to

Fixing mojibakes in UTF-8 text

阅读更多关于 Fixing mojibakes in UTF-8 text

I have a file with text in Portuguese in UTF-8. Somehow, who produced the file selected the wrong encoding, and the text is full of mojibake : IDENTIFICAÌàÌÄO instead of identificação AndrÃ© instead of André Automated tools do not see anything wrong with the file. I tried to fix it with Python package ftfy to no avail. How can I fix this file, apart from replacing all incorrect characters manually? "AndrÃ©" instead of "André" is the Latin-1 interpretation of UTF-8 encoding. You can fix it by inverting the encoding/decoding: >>> 'AndrÃ©'.encode('latin-1').decode('utf-8') 'André' All cases

PHP Ansi to UTF-8

阅读更多关于 PHP Ansi to UTF-8

I'm trying to create a script in PHP for converting some files to UTF-8. I have a file in Greek, where Notepad++ indicates that it ahs "ANSI" encoding. When I upload it to the server, it detects it's encoding as UTF-8 (something wrinf i think). Then when I convert it's contents to UTF-8 with utf8_encode () and download the new file, the characters are messed up. I tried to remove the BOM with PHP and the result was the same. I tried to remove the BOM with PHP without converting the file to UTF-8 but the file remained in ANSI encoding, without messed up characters. How can I fix that? OZ_

PHP Ansi to UTF-8

阅读更多关于 PHP Ansi to UTF-8

问题 I'm trying to create a script in PHP for converting some files to UTF-8. I have a file in Greek, where Notepad++ indicates that it ahs "ANSI" encoding. When I upload it to the server, it detects it's encoding as UTF-8 (something wrinf i think). Then when I convert it's contents to UTF-8 with utf8_encode () and download the new file, the characters are messed up. I tried to remove the BOM with PHP and the result was the same. I tried to remove the BOM with PHP without converting the file to

nodejs synchronization read large file line by line?

阅读更多关于 nodejs synchronization read large file line by line?

问题 I have a large file (utf8). I know fs.createReadStream can create stream to read a large file, but not synchronized. So i try to use fs.readSync , but read text is broken like "迈�" . var fs = require('fs'); var util = require('util'); var textPath = __dirname + '/people-daily.txt'; var fd = fs.openSync(textPath, "r"); var text = fs.readSync(fd, 4, 0, "utf8"); console.log(util.inspect(text, true, null)); 回答1: For large files, readFileSync can be inconvenient, as it loads the whole file in

Unbaking mojibake

阅读更多关于 Unbaking mojibake

When you have incorrectly decoded characters, how can you identify likely candidates for the original string? Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png I know for a fact that this image filename should have been some Japanese characters. But with various guesses at urllib quoting/unquoting, encode and decode iso8859-1, utf8, I haven't been able to unmunge and get the original filename. Is the corruption reversible? galinden You could use chardet (install with pip): import chardet your_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb" detected_encoding = chardet.detect(your_str)["encoding"] try: correct_str =

In what world would \\\\u00c3\\\\u00a9 become é?

阅读更多关于 In what world would \\\\u00c3\\\\u00a9 become é?

I have a likely improperly encoded json document from a source I do not control, which contains the following strings: d\u00c3\u00a9cor business\u00e2\u20ac\u2122 active accounts the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label From this, I am gathering they intend for \u00c3\u00a9 to beceom é , which would be utf-8 hex C3 A9 . That makes some sense. For the others, I assume we are dealing with some types of directional quotation marks. My theory here is that this is either using some encoding I've never encountered before, or that it has been double-encoded in some way. I am fine

Python correct encoding of Website (Beautiful Soup)

阅读更多关于 Python correct encoding of Website (Beautiful Soup)

I am trying to load a html-page and output the text, even though i am getting the webpage correctly, BeautifulSoup destroys somehow the encoding. Source: # -*- coding: utf-8 -*- import requests from BeautifulSoup import BeautifulSoup url = "http://www.columbia.edu/~fdc/utf8/" r = requests.get(url) encodedText = r.text.encode("utf-8") soup = BeautifulSoup(encodedText) text = str(soup.findAll(text=True)) print text.decode("utf-8") Excerpt Output: ...Odenw\xc3\xa4lderisch... this should be Odenwälderisch You are making two mistakes; you are mis-handling encoding, and you are treating a result