mojibake

Fixing mojibakes in UTF-8 text

冷暖自知 提交于 2019-12-20 03:13:37
问题 I have a file with text in Portuguese in UTF-8. Somehow, who produced the file selected the wrong encoding, and the text is full of mojibake: IDENTIFICAÌàÌÄO instead of identificação André instead of André Automated tools do not see anything wrong with the file. I tried to fix it with Python package ftfy to no avail. How can I fix this file, apart from replacing all incorrect characters manually? 回答1: "André" instead of "André" is the Latin-1 interpretation of UTF-8 encoding. You can fix it

Python correct encoding of Website (Beautiful Soup)

我与影子孤独终老i 提交于 2019-12-17 23:33:47
问题 I am trying to load a html-page and output the text, even though i am getting the webpage correctly, BeautifulSoup destroys somehow the encoding. Source: # -*- coding: utf-8 -*- import requests from BeautifulSoup import BeautifulSoup url = "http://www.columbia.edu/~fdc/utf8/" r = requests.get(url) encodedText = r.text.encode("utf-8") soup = BeautifulSoup(encodedText) text = str(soup.findAll(text=True)) print text.decode("utf-8") Excerpt Output: ...Odenw\xc3\xa4lderisch... this should be

How to replace � in a string

柔情痞子 提交于 2019-12-17 06:36:36
问题 I have a string that contains a character � I haven't been able to replace it correctly. String.replace("�", ""); doesn't work, does anyone know how to remove/replace the � in the string?? 回答1: That's the Unicode Replacement Character, \uFFFD. (info) Something like this should work: String strImport = "For some reason my �double quotes� were lost."; strImport = strImport.replaceAll("\uFFFD", "\""); 回答2: Character issues like this are difficult to diagnose because information is easily

Getting ’ instead of an apostrophe(') in PHP

匆匆过客 提交于 2019-12-17 04:26:10
问题 I've tried converting the text to or from utf8, which didn't seem to help. I'm getting: "It’s Getting the Best of Me" It should be: "It’s Getting the Best of Me" I'm getting this data from this url. 回答1: To convert to HTML entities: <?php echo mb_convert_encoding( file_get_contents('http://www.tvrage.com/quickinfo.php?show=Surviver&ep=20x02&exact=0'), "HTML-ENTITIES", "UTF-8" ); ?> See docs for mb_convert_encoding for more encoding options. 回答2: Make sure your html header specifies utf8

Facebook JSON badly encoded

☆樱花仙子☆ 提交于 2019-12-17 02:37:49
问题 I downloaded my Facebook messenger data (in your Facebook account, go to settings, then to Your Facebook information , then Download your information , then create a file with at least the Messages box checked) to do some cool statistics However there is a small problem with encoding. I'm not sure, but it looks like Facebook used bad encoding for this data. When I open it with text editor I see something like this: Rados\u00c5\u0082aw . When I try to open it with python (UTF-8) I get RadosÅ

$_POST will convert from utf-8 to ä ö ü etc

不想你离开。 提交于 2019-12-09 14:32:43
问题 I am new here, so I apologize if I am doing anything wrong. I have a form which submits user input onto another page. User is expected to type ä, ö, é, etc... I have placed all of the following in the document: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> header('Content-Type:text/html; charset=UTF-8'); <form action="whatever.php" accept-charset="UTF-8"> I even tried: ini_set('default_charset', 'UTF-8'); When the other page loads, I need to check what the user input

python replace unicode characters

纵饮孤独 提交于 2019-12-07 06:19:58
问题 I wrote a program to read in Windows DNS debugging log, but inside always got some funny characters in the domain field. Below is one of the example: (13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)' I want to replace all the \x.. with a ? I explicitly type \xc2 as follows works line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)' re.sub('\\\xc2', '?', line) result: '(13)?\xb5?\xb1?\xbe\xc3\xa2p\xc3\xb4?\x8d(5)example(3)com(0)' But its not

python replace unicode characters

主宰稳场 提交于 2019-12-05 11:06:27
I wrote a program to read in Windows DNS debugging log, but inside always got some funny characters in the domain field. Below is one of the example: (13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)' I want to replace all the \x.. with a ? I explicitly type \xc2 as follows works line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)' re.sub('\\\xc2', '?', line) result: '(13)?\xb5?\xb1?\xbe\xc3\xa2p\xc3\xb4?\x8d(5)example(3)com(0)' But its not working if I write as follow: re.sub('\\\x..', '?', line) How I can write a regular expression to

Converting special charactes such as ü and à back to their original, latin alphbet counterparts in C#

十年热恋 提交于 2019-12-04 09:07:30
问题 I have been given an export from a MySQL database that seems to have had it's encoding muddled somewhat over time and contains a mix of HTML char codes such as & uuml; and more problematic characters representing the same letters such as ü and à . It is my task to to bring some consistency back to the file and get everything into the correct Latin characters, e.g. ú and ó . An example of the sort of string I am dealing with is Desinfektionslösungstücher für Flächen Which should

$_POST will convert from utf-8 to ä ö ü etc

ⅰ亾dé卋堺 提交于 2019-12-04 00:11:19
I am new here, so I apologize if I am doing anything wrong. I have a form which submits user input onto another page. User is expected to type ä, ö, é, etc... I have placed all of the following in the document: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> header('Content-Type:text/html; charset=UTF-8'); <form action="whatever.php" accept-charset="UTF-8"> I even tried: ini_set('default_charset', 'UTF-8'); When the other page loads, I need to check what the user input with something like: if ( $_POST['field'] == $check ) { ... } But if he inputs something like 'München',