Convert International String to \u Codes in java

后端 未结 12 2026
离开以前
离开以前 2020-11-29 02:53

How can I convert an international (e.g. Russian) String to \\u numbers (unicode numbers)
e.g. \\u041e\\u041a for OK ?

相关标签:
12条回答
  • 2020-11-29 03:08

    There are three parts to the answer

    1. Get the Unicode for each character
    2. Determine if it is in the Cyrillic Page
    3. Convert to Hexadecimal.

    To get each character you can iterate through the String using the charAt() or toCharArray() methods.

    for( char c : s.toCharArray() )
    

    The value of the char is the Unicode value.

    The Cyrillic Unicode characters are any character in the following ranges:

    Cyrillic:            U+0400–U+04FF ( 1024 -  1279)
    Cyrillic Supplement: U+0500–U+052F ( 1280 -  1327)
    Cyrillic Extended-A: U+2DE0–U+2DFF (11744 - 11775)
    Cyrillic Extended-B: U+A640–U+A69F (42560 - 42655)
    

    If it is in this range it is Cyrillic. Just perform an if check. If it is in the range use Integer.toHexString() and prepend the "\\u". Put together it should look something like this:

    final int[][] ranges = new int[][]{ 
            {  1024,  1279 }, 
            {  1280,  1327 }, 
            { 11744, 11775 }, 
            { 42560, 42655 },
        };
    StringBuilder b = new StringBuilder();
    
    for( char c : s.toCharArray() ){
        int[] insideRange = null;
        for( int[] range : ranges ){
            if( range[0] <= c && c <= range[1] ){
                insideRange = range;
                break;
            }
        }
    
        if( insideRange != null ){
            b.append( "\\u" ).append( Integer.toHexString(c) );
        }else{
            b.append( c );
        }
    }
    
    return b.toString();
    

    Edit: probably should make the check c < 128 and reverse the if and the else bodies; you probably should escape everything that isn't ASCII. I was probably too literal in my reading of your question.

    0 讨论(0)
  • 2020-11-29 03:08

    You could probably hack if from this JavaScript code:

    /* convert                                                                     
    0 讨论(0)
  • 2020-11-29 03:10

    there is a JDK tools executed via command line as following :

    native2ascii -encoding utf8 src.txt output.txt
    

    Example :

    src.txt

    بسم الله الرحمن الرحيم
    

    output.txt

    \u0628\u0633\u0645 \u0627\u0644\u0644\u0647 \u0627\u0644\u0631\u062d\u0645\u0646 \u0627\u0644\u0631\u062d\u064a\u0645
    

    If you want to use it in your Java application, you can wrap this command line by :

    String pathSrc = "./tmp/src.txt";
    String pathOut = "./tmp/output.txt";
    String cmdLine = "native2ascii -encoding utf8 " + new File(pathSrc).getAbsolutePath() + " " + new File(pathOut).getAbsolutePath();
    Runtime.getRuntime().exec(cmdLine);
    System.out.println("THE END");
    

    Then read content of the new file.

    0 讨论(0)
  • 2020-11-29 03:12

    There's a command-line tool that ships with java called native2ascii. This converts unicode files to ASCII-escaped files. I've found that this is a necessary step for generating .properties files for localization.

    0 讨论(0)
  • 2020-11-29 03:16

    Here's an improved version of ArtB's answer:

        StringBuilder b = new StringBuilder();
    
        for (char c : input.toCharArray()) {
            if (c >= 128)
                b.append("\\u").append(String.format("%04X", (int) c));
            else
                b.append(c);
        }
    
        return b.toString();
    

    This version escapes all non-ASCII chars and works correctly for low Unicode code points like Ä.

    0 讨论(0)
  • 2020-11-29 03:16

    There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:

    result = "Hello World";
    result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
    System.out.println(result);
    result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
    System.out.println(result);
    

    The output of this code is:

    \u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
    Hello World
    

    The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc

    Here is javadoc for the class StringUnicodeEncoderDecoder

    0 讨论(0)
提交回复
热议问题