mojibake

Character Encoding and the ’ Issue

回眸只為那壹抹淺笑 提交于 2019-11-28 14:00:45
Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post: ( Note : This is an example, not a spam job post... :-) I have recently seen that exact error on websites, in popular IM programs, and in the background graphics on CNN. My two-part question: What causes this particular, common encoding issue? As a developer, what should I do with user input to avoid common encoding issues like this one? If this question requires simplification to provide a meaningful answer, assume content is entered through a web browser. What causes

Fixing encodings

孤者浪人 提交于 2019-11-28 11:45:13
I have ended up with messed up character encodings in one of our mysql columns. Typically I have √© instead of é √∂ instead of ö √≠ instead of í and so on... Fairly certain that someone here would know what happened and how to fix. UPDATE: Based on bobince's answer and since I had this data in a file I did the following #!/user/bin/env python import codecs f = codecs.open('./file.csv', 'r', 'utf-8') f2 = codecs.open('./file-fixed.csv', 'w', 'utf-8') for line in f: f2.write(line.encode('macroman').decode('utf-8')), after which load data infile 'file-fixed.csv' into table list1 fields terminated

Unbaking mojibake

旧巷老猫 提交于 2019-11-28 02:06:38
问题 When you have incorrectly decoded characters, how can you identify likely candidates for the original string? Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png I know for a fact that this image filename should have been some Japanese characters. But with various guesses at urllib quoting/unquoting, encode and decode iso8859-1, utf8, I haven't been able to unmunge and get the original filename. Is the corruption reversible? 回答1: You could use chardet (install with pip): import chardet your_str = "Ä×èÈÄÄî▒è

In what world would \\u00c3\\u00a9 become é?

╄→гoц情女王★ 提交于 2019-11-28 00:47:05
问题 I have a likely improperly encoded json document from a source I do not control, which contains the following strings: d\u00c3\u00a9cor business\u00e2\u20ac\u2122 active accounts the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label From this, I am gathering they intend for \u00c3\u00a9 to beceom é , which would be utf-8 hex C3 A9 . That makes some sense. For the others, I assume we are dealing with some types of directional quotation marks. My theory here is that this is either using

How to convert these strange characters? (ë, Ã, ì, ù, Ã)

拥有回忆 提交于 2019-11-27 11:39:24
My page often shows things like ë, Ã, ì, ù, à in place of normal characters. I use utf8 for header page and MySQL encode. How does this happen? These are utf-8 encoded characters. Use utf8_decode() to convert them to normal ISO-8859-1 characters. If you see those characters you probably just didn’t specify the character encoding properly . Because those characters are the result when an UTF-8 multi-byte string is interpreted with a single-byte encoding like ISO 8859-1 or Windows-1252 . In this case ë could be encoded with 0xC3 0xAB that represents the Unicode character ë (U+00EB) in UTF-8.

Character Encoding and the ’ Issue

﹥>﹥吖頭↗ 提交于 2019-11-27 08:04:47
问题 Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post: ( Note : This is an example, not a spam job post... :-) I have recently seen that exact error on websites, in popular IM programs, and in the background graphics on CNN. My two-part question: What causes this particular, common encoding issue? As a developer, what should I do with user input to avoid common encoding issues like this one? If this question requires

Getting ’ instead of an apostrophe(') in PHP

限于喜欢 提交于 2019-11-26 18:41:32
I've tried converting the text to or from utf8, which didn't seem to help. I'm getting: "It’s Getting the Best of Me" It should be: "It’s Getting the Best of Me" I'm getting this data from this url. Matthew To convert to HTML entities: <?php echo mb_convert_encoding( file_get_contents('http://www.tvrage.com/quickinfo.php?show=Surviver&ep=20x02&exact=0'), "HTML-ENTITIES", "UTF-8" ); ?> See docs for mb_convert_encoding for more encoding options. Make sure your html header specifies utf8 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> That usually does the trick for me

Facebook JSON badly encoded

此生再无相见时 提交于 2019-11-26 16:40:18
I downloaded my Facebook messenger data (in your Facebook account, go to settings, then to Your Facebook information , then Download your information , then create a file with at least the Messages box checked) to do some cool statistics However there is a small problem with encoding. I'm not sure, but it looks like Facebook used bad encoding for this data. When I open it with text editor I see something like this: Rados\u00c5\u0082aw . When I try to open it with python (UTF-8) I get RadosÅ\x82aw . However I should get: Radosław . My python script: text = open(os.path.join(subdir, file),

How to convert these strange characters? (ë, Ã, ì, ù, Ã)

核能气质少年 提交于 2019-11-26 15:37:24
问题 My page often shows things like ë, Ã, ì, ù, à in place of normal characters. I use utf8 for header page and MySQL encode. How does this happen? 回答1: These are utf-8 encoded characters. Use utf8_decode() to convert them to normal ISO-8859-1 characters. 回答2: If you see those characters you probably just didn’t specify the character encoding properly. Because those characters are the result when an UTF-8 multi-byte string is interpreted with a single-byte encoding like ISO 8859-1 or Windows

“’” showing on page instead of “ &#39; ”

半世苍凉 提交于 2019-11-26 01:43:10
问题 ’ is showing on my page instead of \' . I have the Content-Type set to UTF-8 in both my <head> tag and my HTTP headers: <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" /> In addition, my browser is set to Unicode (UTF-8) : So what\'s the problem, and how can I fix it? 回答1: Ensure the browser and editor are using UTF-8 encoding instead of ISO-8859-1/Windows-1252. Or use ’ . 回答2: So what's the problem, It's a ’ (RIGHT SINGLE QUOTATION MARK - U+2019) character which has