utf-16

How can I convert UTF-16 to UTF-32 in java?

六眼飞鱼酱① 提交于 2019-12-04 11:48:27
I have looked for solutions, but there doesn't seem to be much on this topic. I have found solutions that suggest: String unicodeString = new String("utf8 here"); byte[] bytes = String.getBytes("UTF8"); String converted = new String(bytes,"UTF16"); for converting to utf16 from utf8, however, java doesn't handle "UTF32", which makes this solution unviable. Does anyone know any other way on how to achieve this? Java does handle UTF-32, try this test byte[] a = "1".getBytes("UTF-32"); System.out.println(a.length); it will show that arrays' lentgh = 4 after searching I got this to work: public

Does std::wstring support UTF-16 and UTF-32 on Windows?

天涯浪子 提交于 2019-12-04 11:05:08
I'm learning about Unicode and have a few questions that I'm hoping to get answered. 1) I've read that on Linux, a std::wstring is 4-bytes, while on Windows, it's 2-bytes. Does this mean that Linux internal support is UTF-32 while Windows it is UTF-16 ? 2) Is the use of std::wstring very similar to the std::string interface? 3) Does VC++ offer support for using a 4-byte std::wstring? 4) Do you have to change compiler options if you use std::wstring? As a sidenote, I came across a string library for working with UTF-8 which has a very similar interface to std::string which provides familiar

How to convert a utf16 ushort array to a utf8 std::string?

爷,独闯天下 提交于 2019-12-04 07:58:56
Currently I'm writing a plugin which is just a wrapper around an existing library. The plugin's host passes to me an utf-16 formatted string defined as following typedef unsigned short PA_Unichar; And the wrapped library accepts only a const char* or a std::string utf-8 formatted string I tried writing a conversion function like std::string toUtf8(const PA_Unichar* data) { std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert; return std::string(convert.to_bytes(static_cast<const char16_t*>(data)); } But obviously this doesn't work, throwing me a compile error "static_cast

Size of wchar_t* for surrogate pair (Unicode character out of BMP) on Windows

送分小仙女□ 提交于 2019-12-04 07:55:45
I have encountered an interesting issue on Windows 8. I tested I can represent Unicode characters which are out of the BMP with wchar_t* strings. The following test code produced unexpected results for me: const wchar_t* s1 = L"a"; const wchar_t* s2 = L"\U0002008A"; // The "Han" character int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows. int i2 = sizeof(s1); // i2 == 4, because of the terminating '\0' (I guess). int i3 = sizeof(s2); // i3 == 4, why? The U+2008A is the Han character , which is out of the Binary Multilingual Pane, so it should be represented by a surrogate

How to Convert UTF-16 to UTF-32 and Print the Resulting wchar_t in C?

耗尽温柔 提交于 2019-12-04 03:44:59
问题 i'm trying to print out a string of UTF-16 characters. i posted this question a while back and the advice given was to convert to UTF-32 using iconv and print it as a string of wchar_t. i've done some research, and managed to code the following: // *c is the pointer to the characters (UTF-16) i'm trying to print // sz is the size in bytes of the input i'm trying to print iconv_t icv; char in_buf[sz]; char* in; size_t in_sz; char out_buf[sz * 2]; char* out; size_t out_sz; icv = iconv_open("UTF

How can I match emoji with an R regex?

試著忘記壹切 提交于 2019-12-04 03:41:16
I want to determine which elements of my vector contain emoji: x = c('😂', 'no', '🍹', '😀', 'no', '😛', '䨺', '감사') x # [1] "\U0001f602" "no" "\U0001f379" "\U0001f600" "no" "\U0001f61b" "䨺" "감사" Related posts only cover other languages, and because mostly they refer to specialized libraries, I couldn't figure out a way to translate to R: What is the regex to extract all the emojis from a string? How do I remove emoji from string replace emoji unicode symbol using regexp in javascript Regular expression matching emoji in Mac OS X / iOS remove unicode emoji using re in python The second looked very

How do I convert a string in UTF-16 to UTF-8 in C++

喜欢而已 提交于 2019-12-03 22:59:46
问题 Consider: STDMETHODIMP CFileSystemAPI::setRRConfig( BSTR config_str, VARIANT* ret ) { mReportReaderFactory.reset( new sbis::report_reader::ReportReaderFactory() ); USES_CONVERSION; std::string configuration_str = W2A( config_str ); But in config_str I get a string in UTF-16. How can I convert it to UTF-8 in this piece of code? 回答1: If you are using C++11 you may check this out: http://www.cplusplus.com/reference/codecvt/codecvt_utf8_utf16/ 回答2: void encode_unicode_character(char* buffer, int*

How to force UTF-16 while reading/writing in Java?

五迷三道 提交于 2019-12-03 20:22:49
问题 I see that you can specify UTF-16 as the charset via Charset.forName("UTF-16") , and that you can create a new UTF-16 decoder via Charset.forName("UTF-16").newDecoder() , but I only see the ability to specify a CharsetDecoder on InputStreamReader 's constructor. How so how do you specify to use UTF-16 while reading any stream in Java? 回答1: Input streams deal with raw bytes. When you read directly from an input stream, all you get is raw bytes where character sets are irrelevant. The

how can I convert wstring to u16string?

大城市里の小女人 提交于 2019-12-03 16:19:42
I want to convert wstring to u16string in C++. I can convert wstring to string, or reverse. But I don't know how convert to u16string . u16string CTextConverter::convertWstring2U16(wstring str) { int iSize; u16string szDest[256] = {}; memset(szDest, 0, 256); iSize = WideCharToMultiByte(CP_UTF8, NULL, str.c_str(), -1, NULL, 0,0,0); WideCharToMultiByte(CP_UTF8, NULL, str.c_str(), -1, szDest, iSize,0,0); u16string s16 = szDest; return s16; } Error in WideCharToMultiByte(CP_UTF8, NULL, str.c_str(), -1, szDest, iSize,0,0);'s szDest . Cause of u16string can't use with LPSTR . How can I fix this code

How do I create a string with a surrogate pair inside of it?

♀尐吖头ヾ 提交于 2019-12-03 16:10:57
问题 I saw this post on Jon Skeet's blog where he talks about string reversing. I wanted to try the example he showed myself, but it seems to work... which leads me to believe that I have no idea how to create a string that contains a surrogate pair which will actually cause the string reversal to fail. How does one actually go about creating a string with a surrogate pair in it so that I can see the failure myself? 回答1: The term "surrogate pair" refers to a means of encoding Unicode characters