mojibake | 易学教程

Character Encoding and the â€™ Issue

阅读更多关于 Character Encoding and the â€™ Issue

Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post: ( Note : This is an example, not a spam job post... :-) I have recently seen that exact error on websites, in popular IM programs, and in the background graphics on CNN. My two-part question: What causes this particular, common encoding issue? As a developer, what should I do with user input to avoid common encoding issues like this one? If this question requires simplification to provide a meaningful answer, assume content is entered through a web browser. What causes

Fixing encodings

阅读更多关于 Fixing encodings

I have ended up with messed up character encodings in one of our mysql columns. Typically I have √© instead of é √∂ instead of ö √≠ instead of í and so on... Fairly certain that someone here would know what happened and how to fix. UPDATE: Based on bobince's answer and since I had this data in a file I did the following #!/user/bin/env python import codecs f = codecs.open('./file.csv', 'r', 'utf-8') f2 = codecs.open('./file-fixed.csv', 'w', 'utf-8') for line in f: f2.write(line.encode('macroman').decode('utf-8')), after which load data infile 'file-fixed.csv' into table list1 fields terminated

Unbaking mojibake

阅读更多关于 Unbaking mojibake

问题 When you have incorrectly decoded characters, how can you identify likely candidates for the original string? Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png I know for a fact that this image filename should have been some Japanese characters. But with various guesses at urllib quoting/unquoting, encode and decode iso8859-1, utf8, I haven't been able to unmunge and get the original filename. Is the corruption reversible? 回答1: You could use chardet (install with pip): import chardet your_str = "Ä×èÈÄÄî▒è

In what world would \\u00c3\\u00a9 become é?

阅读更多关于 In what world would \\u00c3\\u00a9 become é?

问题 I have a likely improperly encoded json document from a source I do not control, which contains the following strings: d\u00c3\u00a9cor business\u00e2\u20ac\u2122 active accounts the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label From this, I am gathering they intend for \u00c3\u00a9 to beceom é , which would be utf-8 hex C3 A9 . That makes some sense. For the others, I assume we are dealing with some types of directional quotation marks. My theory here is that this is either using

How to convert these strange characters? (Ã«, Ã, Ã¬, Ã¹, Ã)

阅读更多关于 How to convert these strange characters? (Ã«, Ã, Ã¬, Ã¹, Ã)

My page often shows things like Ã«, Ã, Ã¬, Ã¹, Ã in place of normal characters. I use utf8 for header page and MySQL encode. How does this happen? These are utf-8 encoded characters. Use utf8_decode() to convert them to normal ISO-8859-1 characters. If you see those characters you probably just didn’t specify the character encoding properly . Because those characters are the result when an UTF-8 multi-byte string is interpreted with a single-byte encoding like ISO 8859-1 or Windows-1252 . In this case Ã« could be encoded with 0xC3 0xAB that represents the Unicode character ë (U+00EB) in UTF-8.

Character Encoding and the â€™ Issue

阅读更多关于 Character Encoding and the â€™ Issue

问题 Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post: ( Note : This is an example, not a spam job post... :-) I have recently seen that exact error on websites, in popular IM programs, and in the background graphics on CNN. My two-part question: What causes this particular, common encoding issue? As a developer, what should I do with user input to avoid common encoding issues like this one? If this question requires

Getting â€™ instead of an apostrophe(') in PHP

阅读更多关于 Getting â€™ instead of an apostrophe(') in PHP

I've tried converting the text to or from utf8, which didn't seem to help. I'm getting: "Itâ€™s Getting the Best of Me" It should be: "It’s Getting the Best of Me" I'm getting this data from this url. Matthew To convert to HTML entities: <?php echo mb_convert_encoding( file_get_contents('http://www.tvrage.com/quickinfo.php?show=Surviver&ep=20x02&exact=0'), "HTML-ENTITIES", "UTF-8" ); ?> See docs for mb_convert_encoding for more encoding options. Make sure your html header specifies utf8 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> That usually does the trick for me

Facebook JSON badly encoded

阅读更多关于 Facebook JSON badly encoded

I downloaded my Facebook messenger data (in your Facebook account, go to settings, then to Your Facebook information , then Download your information , then create a file with at least the Messages box checked) to do some cool statistics However there is a small problem with encoding. I'm not sure, but it looks like Facebook used bad encoding for this data. When I open it with text editor I see something like this: Rados\u00c5\u0082aw . When I try to open it with python (UTF-8) I get RadosÅ\x82aw . However I should get: Radosław . My python script: text = open(os.path.join(subdir, file),

How to convert these strange characters? (Ã«, Ã, Ã¬, Ã¹, Ã)

阅读更多关于 How to convert these strange characters? (Ã«, Ã, Ã¬, Ã¹, Ã)

问题 My page often shows things like Ã«, Ã, Ã¬, Ã¹, Ã in place of normal characters. I use utf8 for header page and MySQL encode. How does this happen? 回答1: These are utf-8 encoded characters. Use utf8_decode() to convert them to normal ISO-8859-1 characters. 回答2: If you see those characters you probably just didn’t specify the character encoding properly. Because those characters are the result when an UTF-8 multi-byte string is interpreted with a single-byte encoding like ISO 8859-1 or Windows

“â€™” showing on page instead of “ ' ”

阅读更多关于 “â€™” showing on page instead of “ ' ”

问题 â€™ is showing on my page instead of \' . I have the Content-Type set to UTF-8 in both my <head> tag and my HTTP headers: <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" /> In addition, my browser is set to Unicode (UTF-8) : So what\'s the problem, and how can I fix it? 回答1: Ensure the browser and editor are using UTF-8 encoding instead of ISO-8859-1/Windows-1252. Or use ’ . 回答2: So what's the problem, It's a ’ (RIGHT SINGLE QUOTATION MARK - U+2019) character which has