Java regex match characters outside Basic Multilingual Plane

我们两清 提交于 2019-12-30 03:46:12

问题


How can I match characters (with the intention of removing them) from outside the unicode Basic Multilingual Plane in java?


回答1:


To remove all non-BMP characters, the following should work:

String sanitizedString = inputString.replaceAll("[^\u0000-\uFFFF]", "");



回答2:


Are you looking for specific characters or all characters outside the BMP?

If the former, you can use a StringBuilder to construct a string containing code points from the higher planes, and regex will work as expected:

  String test = new StringBuilder().append("test").appendCodePoint(0x10300).append("test").toString();
  Pattern regex = Pattern.compile(new StringBuilder().appendCodePoint(0x10300).toString());

  Matcher matcher = regex.matcher(test);
  matcher.find();
  System.out.println(matcher.start());

If you're looking to remove all non-BMP characters from a string, then I'd use StringBuilder directly rather than regex:

  StringBuilder sb = new StringBuilder(test.length());
  for (int ii = 0 ; ii < test.length() ; )
  {
     int codePoint = test.codePointAt(ii);
     if (codePoint > 0xFFFF)
     {
        ii += Character.charCount(codePoint);
     }
     else
     {
        sb.appendCodePoint(codePoint);
        ii++;
     }
  }


来源:https://stackoverflow.com/questions/4035562/java-regex-match-characters-outside-basic-multilingual-plane

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!