How to remove high-ASCII characters from string like ®, ©, ™ in Java

后端 未结 4 1843
庸人自扰
庸人自扰 2020-12-06 05:52

I want to detect and remove high-ASCII characters like ®, ©, ™ from a String in Java. Is there any open-source library that can do this?

相关标签:
4条回答
  • 2020-12-06 06:38

    A nice way to do this is to use Google Guava CharMatcher:

    String newString = CharMatcher.ASCII.retainFrom(string);
    

    newString will contain only the ASCII characters (code point < 128) from the original string.

    This reads more naturally than a regular expression. Regular expressions can take more effort to understand for subsequent readers of your code.

    0 讨论(0)
  • 2020-12-06 06:53

    I understand that you need to delete: ç,ã,Ã , but for everybody that need to convert ç,ã,Ã ---> c,a,A please have a look at this piece of code:

    Example Code:

    final String input = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ";
    System.out.println(
        Normalizer
            .normalize(input, Normalizer.Form.NFD)
            .replaceAll("[^\\p{ASCII}]", "")
    );
    

    Output:

    This is a funky String

    0 讨论(0)
  • 2020-12-06 06:55

    If you need to remove all non-US-ASCII (i.e. outside 0x0-0x7F) characters, you can do something like this:

    s = s.replaceAll("[^\\x00-\\x7f]", "");
    

    If you need to filter many strings, it would be better to use a precompiled pattern:

    private static final Pattern nonASCII = Pattern.compile("[^\\x00-\\x7f]");
    ...
    s = nonASCII.matcher(s).replaceAll();
    

    And if it's really performance-critical, perhaps Alex Nikolaenkov's suggestion would be better.

    0 讨论(0)
  • 2020-12-06 06:55

    I think that you can easily filter your string by hand and check code of the particular character. If it fits your requirements then add it to a StringBuilder and do toString() to it in the end.

    public static String filter(String str) {
        StringBuilder filtered = new StringBuilder(str.length());
        for (int i = 0; i < str.length(); i++) {
            char current = str.charAt(i);
            if (current >= 0x20 && current <= 0x7e) {
                filtered.append(current);
            }
        }
    
        return filtered.toString();
    }
    
    0 讨论(0)
提交回复
热议问题