utf-16

Does Unicode have a defined maximum number of code points?

徘徊边缘 提交于 2019-11-27 22:18:10
问题 I have read many articles in order to know what is the maximum number of the Unicode code points, but I did not find a final answer. I understood that the Unicode code points were minimized to make all of the UTF-8 UTF-16 and UTF-32 encodings able to handle the same number of code points. But what is this number of code points? The most frequent answer I encountered is that Unicode code points are in the range of 0x000000 to 0x10FFFF (1,114,112 code points) but I have also read in other

Emoji value range

笑着哭i 提交于 2019-11-27 22:17:05
I was trying to take out all emoji chars out of a string (like a sanitizer). But I cannot find a complete set of emoji values. What is the complete set of emoji chars' UTF16 values? The Unicode standard's Unicode® Technical Report #51 includes a list of emoji ( emoji-data.txt ): ... 21A9 ; text ; L1 ; none ; j # V1.1 (↩) LEFTWARDS ARROW WITH HOOK 21AA ; text ; L1 ; none ; j # V1.1 (↪) RIGHTWARDS ARROW WITH HOOK 231A ; emoji ; L1 ; none ; j # V1.1 (⌚) WATCH 231B ; emoji ; L1 ; none ; j # V1.1 (⌛) HOURGLASS ... I believe you would want to remove each character listed in this document which had a

UnicodeDecodeError when performing os.walk

梦想与她 提交于 2019-11-27 20:35:41
I am getting the error: 'ascii' codec can't decode byte 0x8b in position 14: ordinal not in range(128) when trying to do os.walk. The error occurs because some of the files in a directory have the 0x8b (non-utf8) character in them. The files come from a Windows system (hence the utf-16 filenames), but I have copied the files over to a Linux system and am using python 2.7 (running in Linux) to traverse the directories. I have tried passing a unicode start path to os.walk, and all the files & dirs it generates are unicode names until it comes to a non-utf8 name, and then for some reason, it

Using JNA to get/set application identifier

寵の児 提交于 2019-11-27 17:32:34
Following up on my previous question concerning the Windows 7 taskbar , I would like to diagnose why Windows isn't acknowledging that my application is independent of javaw.exe . I presently have the following JNA code to obtain the AppUserModelID : public class AppIdTest { public static void main(String[] args) { NativeLibrary lib; try { lib = NativeLibrary.getInstance("shell32"); } catch (Error e) { System.err.println("Could not load Shell32 library."); return; } Object[] functionArgs = new Object[1]; String functionName = null; Function function; try { functionArgs[0] = new String("Vendor

Using unicode characters bigger than 2 bytes with .Net

烂漫一生 提交于 2019-11-27 15:07:26
问题 I'm using this code to generate U+10FFFC var s = Encoding.UTF8.GetString(new byte[] {0xF4,0x8F,0xBF,0xBC}); I know it's for private-use and such, but it does display a single character as I'd expect when displaying it. The problems come when manipulating this unicode character. If I later do this: foreach(var ch in s) { Console.WriteLine(ch); } Instead of it printing just the single character, it prints two characters (i.e. the string is apparently composed of two characters). If I alter my

Any good solutions for C++ string code point and code unit?

主宰稳场 提交于 2019-11-27 14:54:43
In Java, a String has methods: length()/charAt(), codePointCount()/codePointAt() C++11 has std::string a = u8"很烫烫的一锅汤"; but a.size() is the length of char array, cannot index the unicode char. Is there any solutions for unicode in C++ string ? I generally convert the UTF-8 string to a wide UTF-32/UCS-2 string before doing character operations. C++ does actually give us functions to do that but they are not very user friendly so I have written some nicer conversion functions here: // This should convert to whatever the system wide character encoding // is for the platform (UTF-32/Linux - UCS-2

What's the best way to export UTF8 data into Excel?

故事扮演 提交于 2019-11-27 13:08:51
So we have this web app where we support UTF8 data. Hooray UTF8. And we can export the user-supplied data into CSV no problem - it's still in UTF8 at that point. The problem is when you open a typical UTF8 CSV up in Excel, it reads it as ANSII encoded text, and accordingly tries to read two-byte chars like ø and ü as two separate characters and you end up with fail. So I've done a bit of digging (the Intervals folks have a interesting post about it here ), and there are some limited if ridiculously annoying options out there. Among them: supplying a UTF-16 Little Endian TSV file which Excel

Java Unicode String length

微笑、不失礼 提交于 2019-11-27 10:25:33
问题 I am trying hard to get the count of unicode string and tried various options. Looks like a small problem but struck in a big way. Here I am trying to get the length of the string str1. I am getting it as 6. But actually it is 3. moving the cursor over the string "குமார்" also shows it as 3 chars. Basically I want to measure the length and print each character. like "கு", "மா", "ர்" . public class one { public static void main(String[] args) { String str1 = new String("குமார்"); System.out

VBA Output to file using UTF-16

我是研究僧i 提交于 2019-11-27 09:46:47
I have a very complex problem that is difficult to explain properly. There is LOTS of discussion about this across the internet, but nothing definitive. Any help, or better explanation than mine, is greatly appreciated. Essentially, I'm just trying to write an XML file using UTF-16 with VBA. If I do this: sXML = "<?xml version='1.0' encoding='utf-8'?>" sXML = sXML & rest_of_xml_document Print #iFile, sXML then I get a file that is valid XML. However, if I change the "encoding=" to "utf-16", I get this error from my XML validator: Switch from current encoding to specified encoding not supported

How to write 3 bytes unicode literal in Java?

左心房为你撑大大i 提交于 2019-11-27 08:06:59
问题 I'd like to write unicode literal U+10428 in Java. http://www.marathon-studios.com/unicode/U10428/Deseret_Small_Letter_Long_I I tried with '\u10428' and it doesn't compile. 回答1: Because Java went full-out unicode when people thought 64K are enough for everyone (Where did one hear such before?), they started out with UCS-2 and later upgraded to UTF-16. But they never bothered to add an escape sequence for unicode characters outside the BMP. Thus, your only recourse is manually recoding to a