mojibake | 易学教程

Fixing mojibakes in UTF-8 text

阅读更多关于 Fixing mojibakes in UTF-8 text

问题 I have a file with text in Portuguese in UTF-8. Somehow, who produced the file selected the wrong encoding, and the text is full of mojibake: IDENTIFICAÌàÌÄO instead of identificação AndrÃ© instead of André Automated tools do not see anything wrong with the file. I tried to fix it with Python package ftfy to no avail. How can I fix this file, apart from replacing all incorrect characters manually? 回答1: "AndrÃ©" instead of "André" is the Latin-1 interpretation of UTF-8 encoding. You can fix it

Python correct encoding of Website (Beautiful Soup)

阅读更多关于 Python correct encoding of Website (Beautiful Soup)

问题 I am trying to load a html-page and output the text, even though i am getting the webpage correctly, BeautifulSoup destroys somehow the encoding. Source: # -*- coding: utf-8 -*- import requests from BeautifulSoup import BeautifulSoup url = "http://www.columbia.edu/~fdc/utf8/" r = requests.get(url) encodedText = r.text.encode("utf-8") soup = BeautifulSoup(encodedText) text = str(soup.findAll(text=True)) print text.decode("utf-8") Excerpt Output: ...Odenw\xc3\xa4lderisch... this should be

How to replace ï¿½ in a string

阅读更多关于 How to replace ï¿½ in a string

问题 I have a string that contains a character ï¿½ I haven't been able to replace it correctly. String.replace("ï¿½", ""); doesn't work, does anyone know how to remove/replace the ï¿½ in the string?? 回答1: That's the Unicode Replacement Character, \uFFFD. (info) Something like this should work: String strImport = "For some reason my �double quotes� were lost."; strImport = strImport.replaceAll("\uFFFD", "\""); 回答2: Character issues like this are difficult to diagnose because information is easily

Getting â€™ instead of an apostrophe(') in PHP

阅读更多关于 Getting â€™ instead of an apostrophe(') in PHP

问题 I've tried converting the text to or from utf8, which didn't seem to help. I'm getting: "Itâ€™s Getting the Best of Me" It should be: "It’s Getting the Best of Me" I'm getting this data from this url. 回答1: To convert to HTML entities: <?php echo mb_convert_encoding( file_get_contents('http://www.tvrage.com/quickinfo.php?show=Surviver&ep=20x02&exact=0'), "HTML-ENTITIES", "UTF-8" ); ?> See docs for mb_convert_encoding for more encoding options. 回答2: Make sure your html header specifies utf8

Facebook JSON badly encoded

阅读更多关于 Facebook JSON badly encoded

问题 I downloaded my Facebook messenger data (in your Facebook account, go to settings, then to Your Facebook information , then Download your information , then create a file with at least the Messages box checked) to do some cool statistics However there is a small problem with encoding. I'm not sure, but it looks like Facebook used bad encoding for this data. When I open it with text editor I see something like this: Rados\u00c5\u0082aw . When I try to open it with python (UTF-8) I get RadosÅ

$_POST will convert from utf-8 to Ã¤ Ã¶ Ã¼ etc

阅读更多关于 $_POST will convert from utf-8 to Ã¤ Ã¶ Ã¼ etc

问题 I am new here, so I apologize if I am doing anything wrong. I have a form which submits user input onto another page. User is expected to type ä, ö, é, etc... I have placed all of the following in the document: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> header('Content-Type:text/html; charset=UTF-8'); <form action="whatever.php" accept-charset="UTF-8"> I even tried: ini_set('default_charset', 'UTF-8'); When the other page loads, I need to check what the user input

python replace unicode characters

阅读更多关于 python replace unicode characters

问题 I wrote a program to read in Windows DNS debugging log, but inside always got some funny characters in the domain field. Below is one of the example: (13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)' I want to replace all the \x.. with a ? I explicitly type \xc2 as follows works line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)' re.sub('\\\xc2', '?', line) result: '(13)?\xb5?\xb1?\xbe\xc3\xa2p\xc3\xb4?\x8d(5)example(3)com(0)' But its not

python replace unicode characters

阅读更多关于 python replace unicode characters

I wrote a program to read in Windows DNS debugging log, but inside always got some funny characters in the domain field. Below is one of the example: (13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)' I want to replace all the \x.. with a ? I explicitly type \xc2 as follows works line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)' re.sub('\\\xc2', '?', line) result: '(13)?\xb5?\xb1?\xbe\xc3\xa2p\xc3\xb4?\x8d(5)example(3)com(0)' But its not working if I write as follow: re.sub('\\\x..', '?', line) How I can write a regular expression to

Converting special charactes such as Ã¼ and Ãƒ back to their original, latin alphbet counterparts in C#

阅读更多关于 Converting special charactes such as Ã¼ and Ãƒ back to their original, latin alphbet counterparts in C#

问题 I have been given an export from a MySQL database that seems to have had it's encoding muddled somewhat over time and contains a mix of HTML char codes such as & uuml; and more problematic characters representing the same letters such as Ã¼ and Ãƒ . It is my task to to bring some consistency back to the file and get everything into the correct Latin characters, e.g. ú and ó . An example of the sort of string I am dealing with is DesinfektionslÃƒÂ¶sungstÃƒÂ¼cher fÃƒÂ¼r FlÃƒÂ¤chen Which should

$_POST will convert from utf-8 to Ã¤ Ã¶ Ã¼ etc

阅读更多关于 $_POST will convert from utf-8 to Ã¤ Ã¶ Ã¼ etc

I am new here, so I apologize if I am doing anything wrong. I have a form which submits user input onto another page. User is expected to type ä, ö, é, etc... I have placed all of the following in the document: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> header('Content-Type:text/html; charset=UTF-8'); <form action="whatever.php" accept-charset="UTF-8"> I even tried: ini_set('default_charset', 'UTF-8'); When the other page loads, I need to check what the user input with something like: if ( $_POST['field'] == $check ) { ... } But if he inputs something like 'München',