character-encoding | 易学教程

How to convert “binary text” to “visible text”?

阅读更多关于 How to convert “binary text” to “visible text”?

问题 I have a text file full of non-ASCII characters. I can not detect the encoding by either file or enca . file non_ascii.txt non_ascii.txt: Non-ISO extended-ASCII text enca non_ascii.txt Unrecognized encoding But I can open it normally in Windows Notepad++ Edit: The expression above leads misunderstanding. Sorry for this. In fact, I picked some parts of the original file and put them into new text file, then opened in notepad++. The 2 parts shows as below. They are decoded in 2 different ways

unable to print euro symbol in a “C” program

阅读更多关于 unable to print euro symbol in a “C” program

问题 I am unable to print the euro symbol. The program I am using is below. I have set the character set to codepage 1250 which has 0x80 standing for the euro symbol. Program ======= #include <stdio.h> #include <locale.h> int main() { printf("Current locale is: %s\n", setlocale (LC_ALL, ".1250")); printf("Euro character: %c\n", 0x80); getchar(); return 0; } Output ====== Current locale is: English_India.1250 Euro character: ? Other details ============= OS: Windows Vista Compiler: vc++ 2008

unable to print euro symbol in a “C” program

阅读更多关于 unable to print euro symbol in a “C” program

Simplest way to get rid of zero-width-space in c# string

阅读更多关于 Simplest way to get rid of zero-width-space in c# string

问题 I am parsing emails using a regex in a c# VSTO project. Once in a while, the regex does not seem to work (although if I paste the text and regex in regexbuddy, the regex correctly matches the text). If I look at the email in gmail, I see =E2=80=8B at the beginning and end of some lines (which I understand is the UTF8 zero width space); this appears to be what is messing up the regex. This seems to be only sequence showing up. What is the easiest way to get rid of this exact sequence? I cannot

C# partial UTF-8 byte stream conversion

阅读更多关于 C# partial UTF-8 byte stream conversion

问题 I have wrote the following simple test: [Test] public void TestUTF8() { var c = "abc☰def"; var b = Encoding.UTF8.GetBytes(c); Assert.That(b.Length, Is.EqualTo(9)); //Assuming, you are reading a byte stream and got partial result with the first 5 bytes var p = Encoding.UTF8.GetChars(b, 0, 5); Trace.WriteLine(new string(p)); Assert.That(p.Length, Is.EqualTo(3)); } The Trace outputs abc� and the last assert fails because p.Length is 4 . However, I wanted Trace outputs abc and the last assert

Converting from bytes to French text in Python

阅读更多关于 Converting from bytes to French text in Python

问题 I am cleaning the monolingual corpus of Europarl for French (http://data.statmt.org/wmt19/translation-task/fr-de/monolingual/europarl-v7.fr.gz). The original raw data in .gz file (I downloaded using wget ). I want to extract the text and see how it looks like in order to further process the corpus. Using the following code to extract the text from gzip , I obtained data with the class being bytes . with gzip.open(file_path, 'rb') as f_in: print('type(f_in)=', type(f_in)) text = f_in.read()

Converting from bytes to French text in Python

阅读更多关于 Converting from bytes to French text in Python

Converting from bytes to French text in Python

阅读更多关于 Converting from bytes to French text in Python

how randomForest package in R interprets character variables

阅读更多关于 how randomForest package in R interprets character variables

问题 This post is correlated with: How R automatically coerces character input to numeric? I am a user of the randomForest package. I just have a quick question: Can anyone let me know or refer me to the somewhere in the source code that how the randomForest package in R takes/treats character variables? I have used character variables as direct input and I also converted the character variables to factors as input, but the performances are different. Hope for a quick answer or a reference to

UnicodeDecodeError 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte

阅读更多关于 UnicodeDecodeError 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte

问题 I'm trying to open a series of HTML files in order to get the text from the body of those files using BeautifulSoup. I have about 435 files that I wanted to run through but I keep getting this error. I've tried converting the HTML files to text and opening the text files but I get the same error... path = "./Bitcoin" for file in os.listdir(path): with open(os.path.join(path, file), "r") as fname: txt = fname.read() I want to get the source code of the HTML file so I can parse it using