Remove all non-“word characters” from a String in Java, leaving accented characters?

后端 未结 5 1250
被撕碎了的回忆
被撕碎了的回忆 2020-11-28 02:18

Apparently Java\'s Regex flavor counts Umlauts and other special characters as non-\"word characters\" when I use Regex.

        \"TESTÜTEST\".replaceAll( \"         


        
5条回答
  •  暖寄归人
    2020-11-28 03:01

    At times you do not want to simply remove the characters, but just remove the accents. I came up with the following utility class which I use in my Java REST web projects whenever I need to include a String in an URL:

    import java.text.Normalizer;
    import java.text.Normalizer.Form;
    
    import org.apache.commons.lang.StringUtils;
    
    /**
     * Utility class for String manipulation.
     * 
     * @author Stefan Haberl
     */
    public abstract class TextUtils {
        private static String[] searchList = { "Ä", "ä", "Ö", "ö", "Ü", "ü", "ß" };
        private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue",
                "sz" };
    
        /**
         * Normalizes a String by removing all accents to original 127 US-ASCII
         * characters. This method handles German umlauts and "sharp-s" correctly
         * 
         * @param s
         *            The String to normalize
         * @return The normalized String
         */
        public static String normalize(String s) {
            if (s == null)
                return null;
    
            String n = null;
    
            n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList);
            n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\\p{ASCII}]", "");
    
            return n;
        }
    
        /**
         * Returns a clean representation of a String which might be used safely
         * within an URL. Slugs are a more human friendly form of URL encoding a
         * String.
         * 

    * The method first normalizes a String, then converts it to lowercase and * removes ASCII characters, which might be problematic in URLs: *

      *
    • all whitespaces *
    • dots ('.') *
    • (semi-)colons (';' and ':') *
    • equals ('=') *
    • ampersands ('&') *
    • slashes ('/') *
    • angle brackets ('<' and '>') *
    * * @param s * The String to slugify * @return The slugified String * @see #normalize(String) */ public static String slugify(String s) { if (s == null) return null; String n = normalize(s); n = StringUtils.lowerCase(n); n = n.replaceAll("[\\s.:;&=<>/]", ""); return n; } }

    Being a German speaker I've included proper handling of German umlauts as well - the list should be easy to extend for other languages.

    HTH

    EDIT: Note that it may be unsafe to include the returned String in an URL. You should at least HTML encode it to prevent XSS attacks.

提交回复
热议问题