unicode-string

Return code point of characters in C#

孤街浪徒 提交于 2019-11-30 07:55:00
问题 How can I return the Unicode Code Point of a character? For example, if the input is "A", then the output should be "U+0041". Ideally, a solution should take care of surrogate pairs. With code point I mean the actual code point according to Unicode, which is different from code unit (UTF8 has 8-bit code units, UTF16 has 16-bit code units and UTF32 has 32-bit code units, in the latter case the value is equal to the code point, after taking endianness into account). 回答1: Easy, since chars in C#

Automatically change between std::string and std::wstring according to unicode setting in MSVC++?

你。 提交于 2019-11-30 04:45:41
问题 I'm writing a DLL and want to be able to switch between the unicode and multibyte setting in MSVC++2010. For example, I use _T("string") and LPCTSTR and WIN32_FIND_DATA instead of the -W and -A versions and so on. Now I want to have std::strings which change between std::string and std::wstring according to the unicode setting. Is that possible? Otherwise, this will probably end up getting extremely complicated. 回答1: Why not do like the Win32 API does: Use wide characters internally, and

Perl: printing Unicode strings to the Windows console

梦想与她 提交于 2019-11-30 03:34:37
问题 I am encountering a strange problem in printing Unicode strings to the Windows console*. Consider this text: אני רוצה לישון Intermediary היא רוצה לישון אתם, הם Bye Hello, world! test Assume it's in a file called "file.txt". When I go*: "type file.txt", it prints out fine. But when it's printed from a Perl program, like this: use strict; use warnings; use Encode; use 5.014; use utf8; use autodie; use warnings qw< FATAL utf8 >; use open qw< :std :utf8 >; use feature qw< unicode_strings >; use

How do I use 3 and 4-byte Unicode characters with standard C++ strings?

南笙酒味 提交于 2019-11-30 02:16:00
In standard C++ we have char and wchar_t for storing characters. char can store values between 0x00 and 0xFF . And wchar_t can store values between 0x0000 and 0xFFFF . std::string uses char , so it can store 1-byte characters only. std::wstring uses wchar_t , so it can store characters up to 2-byte width. This is what I know about strings in C++. Please correct me if I said anything wrong up to this point. I read the article for UTF-8 in Wikipedia, and I learned that some Unicode characters consume up to 4-byte space. For example, the Chinese character 𤭢 has a Unicode code point 0x24B62 ,

PHP - length of string containing emojis/special chars

坚强是说给别人听的谎言 提交于 2019-11-29 19:08:13
问题 I'm building an API for a mobile application and I seem to have a problem with counting the length of a string containing emojis. My code: $str = "👍🏿✌🏿️ @mention"; printf("strlen: %d" . PHP_EOL, strlen($str)); printf("mb_strlen UTF-8: %d" . PHP_EOL, mb_strlen($str, "UTF-8")); printf("mb_strlen UTF-16: %d" . PHP_EOL, mb_strlen($str, "UTF-16")); printf("iconv UTF-16: %d" . PHP_EOL, iconv_strlen(iconv("UTF-8", "UTF-16", $str))); printf("iconv UTF-16: %d" . PHP_EOL, iconv_strlen(iconv("ISO-8859-1

Python 3: os.walk() file paths UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed

只谈情不闲聊 提交于 2019-11-28 21:26:32
This code: for root, dirs, files in os.walk('.'): print(root) Gives me this error: UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 27: surrogates not allowed How do I walk through a file tree without getting toxic strings like this? On Linux, filenames are 'just a bunch of bytes', and are not necessarily encoded in a particular encoding. Python 3 tries to turn everything into Unicode strings. In doing so the developers came up with a scheme to translate byte strings to Unicode strings and back without loss, and without knowing the original encoding. They used

Java Unicode String length

谁说我不能喝 提交于 2019-11-28 17:10:02
I am trying hard to get the count of unicode string and tried various options. Looks like a small problem but struck in a big way. Here I am trying to get the length of the string str1. I am getting it as 6. But actually it is 3. moving the cursor over the string "குமார்" also shows it as 3 chars. Basically I want to measure the length and print each character. like "கு", "மா", "ர்" . public class one { public static void main(String[] args) { String str1 = new String("குமார்"); System.out.print(str1.length()); } } PS : It is tamil language. halex Found a solution to your problem. Based on

PDO and UTF-8 special characters in PHP / MySQL?

ⅰ亾dé卋堺 提交于 2019-11-28 13:19:53
I am using MySQL and PHP 5.3 and tried this code. $dbhost = 'localhost'; $dbuser = 'root'; $dbpass = ''; $con = mysql_connect("localhost", "root", ""); mysql_set_charset('utf8'); if (!$con) { die('Could not connect: ' . mysql_error()); } mysql_select_db("kdict", $con); $sql = "SELECT * FROM `en-kh` where english='a'"; echo $sql; $result = mysql_query($sql); while($row = mysql_fetch_array($result)) { echo $row['english'] . " </br> " . $row['khmer']; echo "<br />"; } ?> => I got good UTF-8 render display, well done. But for now I create a class PDO to keep easy to extend and more easy class crud

Java: How to create unicode from string “\\u00C3” etc

心不动则不痛 提交于 2019-11-28 12:17:15
I have a file that has strings hand typed as \u00C3. I want to create a unicode character that is being represented by that unicode in java. I tried but could not find how. Help. Edit: When I read the text file String will contain "\u00C3" not as unicode but as ASCII chars '\' 'u' '0' '0' '3'. I would like to form unicode character from that ASCII string. I picked this up somewhere on the web: String unescape(String s) { int i=0, len=s.length(); char c; StringBuffer sb = new StringBuffer(len); while (i < len) { c = s.charAt(i++); if (c == '\\') { if (i < len) { c = s.charAt(i++); if (c == 'u')

Unicode Conversion in c#

对着背影说爱祢 提交于 2019-11-28 07:46:31
问题 i am trying to assign Unicode on string but it return "Привет" string as "Привет" But i need "Привет", i am converting by following function . public string Convert(string str) { byte[] utf8Bytes = Encoding.UTF8.GetBytes(str); str = Encoding.UTF8.GetString(utf8Bytes); return str; } what can i do for solve this problem to return "Привет". 回答1: П is Unicode character 0x041F, and its UTF-8 encoding is 0xD0 0x9F resulting in П. Since the function only returns the input parameter, as