utf

What characters do not directly map from Cp1252 to UTF-8?

跟風遠走 提交于 2019-12-03 00:05:04
I've read in several stackoverflow answers that some characters do not directly map (or are even "unmappable") when converting from Cp1252 (aka Windows-1252; they're the same, aren't they?) to UTF-8, e.g. here: https://stackoverflow.com/a/23399926/2018047 Can someone please shed some more light on this? Does that mean that if I batch/mass convert source code from cp1252 to utf-8 I'll get some characters that will end up as garbage? This is how Windows 1252 codepage looks like. As you can see, bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D do not have anything assigned to them. If your input file contains

utf8 representation as normal text

我是研究僧i 提交于 2019-12-02 01:20:42
$text = "\xd0\xa2\xd0\xb0\xd0\xb9\xd0\xbd\xd0\xb0"; $text = iconv('UTF-8', 'UTF-8//IGNORE', $text); var_dump($text); //Тайна - good $text = file_get_contents('log.txt'); $text = iconv('UTF-8', 'UTF-8//IGNORE', trim($text)); var_dump($text); // \xd0\xa2\xd0\xb0\xd0\xb9\xd0\xbd\xd0\xb0 - bad Why if string \xd0\xa2\xd0\xb0\xd0\xb9\xd0\xbd\xd0\xb0 was read from file iconv did not work and how to fix it ? The string literal and the text in the file is not equivalent. $text is already utf-8 (Тайна) and iconv does nothing to it. This is because you use escape sequences to put the actual binary value

jsp utf encoding

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-01 16:05:54
I'm having a hard time figuring out how to handle this problem: I'm developing a web tool for an Italian university, and I have to display words with accents (such as è, ù, ...); sometimes I get these words from a PostgreSql table (UTF8-encoded), but mostly I have to read long passages from a file. These files are encoded as utf-8 xml, and display fine in Smultron or any utf-8 editor (they were created parsing in python old files with entities such as è instead of "è"). I wrote a java class which extracts the relevant segments from the xml file, which works like this: String s = parseText

Replace éàçè… with equivalent “eace” In GWT

♀尐吖头ヾ 提交于 2019-12-01 10:53:24
I tried s=Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""); But it seems that GWT API doesn't provide such fonction. I tried also : s=s.replace("é",e); But it doesn't work either The scenario is I'am trying to générate token from the clicked Widget's text for the history management okrasz You can take ASCII folding filter from Lucene and add to your project. You can just take foldToASCII() method from ASCIIFoldingFilter (the method does not have any dependencies). There is also a patch in Jira that has a full class for that without any dependencies - see here . It

Iconv is converting to UTF-16 instead of UTF-8 when invoked from powershell

故事扮演 提交于 2019-12-01 04:48:11
问题 I have a problem while trying to batch convert the encoding of some files from ISO-8859-1 to UTF-8 using iconv in a powershell script. I have this bat file, that works ok: for %%f in (*.txt) do ( echo %%f C:\"Program Files"\GnuWin32\bin\iconv.exe -f iso-8859-1 -t utf-8 %%f > %%f.UTF_8_MSDOS ) I need to convert all files on the directories structure, so I programmed this other script, this time using powershell: Get-ChildItem -Recurse -Include *.java | ForEach-Object { $inFileName = $_

How can I put a

风流意气都作罢 提交于 2019-12-01 03:34:28
How can I do this? I'm pretty new to Java and Android and I have the problem described above. When I paste the emoji inside the xml file it shows a white square and another weird character which "copies" the next character. Any idea on how to work this out? Try using this library - emoji-java I know you want an XML way, and this is Java It may help you Example String str = "An 😀awesome 😃string with a few 😉emojis!"; String result = EmojiParser.parseToAliases(myString); System.out.println(myString); // Prints: // "An 😀awesome 😃string with a few 😉emojis!" You can put emojis in an XML, and the

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

女生的网名这么多〃 提交于 2019-12-01 00:56:19
I am trying to write data in a StringIO object using Python and then ultimately load this data into a postgres database using psycopg2's copy_from() function. First when I did this, the copy_from() was throwing an error: ERROR: invalid byte sequence for encoding "UTF8": 0xc92 So I followed this question . I figured out that my Postgres database has UTF8 encoding. The file/StringIO object I am writing my data into shows its encoding as the following: setgid Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators I tried to encode every string that I am writing to

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

半城伤御伤魂 提交于 2019-11-30 20:02:19
问题 I am trying to write data in a StringIO object using Python and then ultimately load this data into a postgres database using psycopg2's copy_from() function. First when I did this, the copy_from() was throwing an error: ERROR: invalid byte sequence for encoding "UTF8": 0xc92 So I followed this question. I figured out that my Postgres database has UTF8 encoding. The file/StringIO object I am writing my data into shows its encoding as the following: setgid Non-ISO extended-ASCII English text,

difference between NLS_NCHAR_CHARACTERSET and NLS_CHARACTERSET for Oracle

那年仲夏 提交于 2019-11-30 14:02:38
i have a quick question here, that i would like to know the difference between NLS_NCHAR_CHARACTERSET and NLS_CHARACTERSET setting in oracle ?? from my understanding NLS_NCHAR_CHARACTERSET is for NVARCHAR data types and for NLS_CHARACTERSET would be for VARCHAR2 data types. i tried to test this on my development server which my current settings for CHARACTERSET is as the following :- PARAMETER VALUE ------------------------------ ---------------------------------------- NLS_NCHAR_CHARACTERSET AL16UTF16 NLS_NUMERIC_CHARACTERS ., NLS_CHARACTERSET US7ASCII Then i inserted some Chinese character

UTF conversion functions in C++11

做~自己de王妃 提交于 2019-11-30 13:30:26
问题 I'm looking for a collection of functions for performing UTF character conversion in C++11. It should include conversion to and from any of utf8, utf16, and utf32. A function for recognizing byte order marks would be helpful, too. 回答1: Update : The functions listed here are maintained in a GitHub repo, .hpp, .cpp and tests. Some UTF-16 functions have been disable because they do not work correctly. The "banana" tests in the utf.test.cpp file demonstrate the problem. Also included a "read_with