cyrillic

Russian symbols in Python output corrupted (ENCODING)

我的梦境 提交于 2021-02-07 19:40:19
问题 I parsed a HTML document and have Russian text in it. When I'm trying to print it in Python, I get this: ÐлÑбниÑнÑй новогодний пÑÐ½Ñ I tried to decode it and I get ISO-8859-1 encoding. I'm trying to decode it like that: print drink_name.decode('iso8859-1') But I get an error. How can I print this text, or encode it in Unicode? 回答1: You have a Mojibake; UTF-8 bytes decoded as Latin-1 or CP1251 in this case. You can repair it by reversing the process: >>> print u'ÐлÑбнÐ

Handle Turkish uppercase and lowercase correctly, need to modify/override built-in functions?

谁说胖子不能爱 提交于 2021-02-07 11:46:11
问题 I am working with multilingual text data, among others with Russian using the Cyrillic alphabet and Turkish. I basically have to compare the words in two files my_file and check_file and if the words in my_file can be found in check_file , write them in an output file keeping the meta-information about these words from both input files. Some words are lowercased while other words are capitalised so I have to lowercase all the words to compare them. As I use Python 3.6.5 and Python 3 uses

Regular expression to match russian, allow all cyrillic characters in .htaccess

坚强是说给别人听的谎言 提交于 2020-01-24 05:30:05
问题 How do i redirect url with russian slug to specific php page. For example I have this url. http://www.example.com/основной-момент.htm and want to redirect to this one in .htaccess http://www.example.com/category.php?slug=<russian slug> 回答1: If it's allowed in your server you could try something like this for the specific page in your .htaccess file: RewriteEngine On RewriteRule ^основной-момент.htm$ category.php?slug=ru In Regex if the character set is enabled on your server you should be

mb_convert_encoding for russian in php

本小妞迷上赌 提交于 2020-01-12 19:02:08
问题 how to convert Russian character to utf-8 in PHP using mb_convert_encoding or any other method? 回答1: Did you try the following? Not sure if it works, though. mb_convert_encoding($str, 'UTF-8', 'auto'); 回答2: $file = 'images/да так 1.jpg';//this is in UTF-8, needs to be system encoding (Russian) $new_filename = mb_convert_encoding($file, "Windows-1251", "utf-8");//turn utf-8 to system encoding Windows-1251 (Russian) now your russian files should open your russian characters in php are already

How to read Cyrillic Unicode file in C++?

拈花ヽ惹草 提交于 2020-01-02 04:12:10
问题 I'm trying to read lines from .txt files, that have been saved as Unicode. That's how i'm doing it: wifstream input; string path = "test.txt"; input.imbue(locale(input.getloc(), new codecvt_utf16<wchar_t, 0x10ffff, consume_header>)); input.open(path); if (input.is_open()) { wstring line; input.seekg( 1 , ios_base::beg); getline(input, line); } It works fine for files with Latin characters. But for Cyrillic files I get weird symbols instead of spaces and adjacent characters. For example: What

Python — check if a string contains Cyrillic characters

依然范特西╮ 提交于 2019-12-30 04:21:24
问题 How to check whether a string contains Cyrillic characters? E.g. >>> has_cyrillic('Hello, world!') False >>> has_cyrillic('Привет, world!') True 回答1: regex supports Unicode properties, along with a few short forms. >>> regex.search(r'\p{IsCyrillic}', 'Hello, world!') >>> regex.search(r'\p{IsCyrillic}', 'Привет, world!') <regex.Match object; span=(0, 1), match='П'> >>> regex.search(r'\p{IsCyrillic}', 'Hello, wёrld!') <regex.Match object; span=(8, 9), match='ё'> 回答2: You can use a regular

Unable to print russian characters

纵然是瞬间 提交于 2019-12-28 06:32:08
问题 I have a russian string which i have encoded to UTF-8 String str = "\u041E\u041A"; System.out.println("String str : " + str); When i print the string in eclipse console i get ?? Can anyone suggest how to print the russian strings to console or what i am doing wrong here? I have tried converting it to bytes using byte myArr[] = str.getBytes("UTF-8") and then new String(myArr, "UTF-8") still same problem :-( 回答1: Try this: String myString = "some cyrillic text"; byte bytes[] = myString.getBytes

getBytes() doesn't work for Cyrillic letters

爷,独闯天下 提交于 2019-12-25 16:19:14
问题 I found some answers but none of them works for me. I want to make a pdf file from a html, but the problem is that my html has Cyrilic letters and I found that there's something to do with this simple code: String s = "Здраво Kris"; byte bytes[] = s.getBytes("UTF-8"); String value = new String(bytes, "ISO-8859-1"); // I tried with new String(bytes, "UTF-8") but it didn't work Then I pass the value to my pdf generator function but it outputs only the part from the string s that is not in