Given mixed accented and normal characters in string not working in java when searching

為{幸葍}努か 提交于 2019-12-05 02:03:48

As you said,

String text = "Cámélan discovered ônte red aleŕt \n Como se extingue la deuda";

So whenever you give first input, receive that first character and compare.

Eg: If you give Ca, then

if (StringUtils.isNotEmpty(substring)) { //this is the search text
substring=substring.substring(0,1); //now you get C alone.

}

So whatever you type it displays by filtering the first character. Now

 SpannableString builder = higlightString((yourContent.getText()), mHighlightColor);
    viewHolder.mTextView.setText(builder);




private SpannableString higlightString(String entireContent, int highlightColor) {
            SpannableString returnValue = new SpannableString(entireContent);

            String lowercaseNodeText = entireContent;
        try {
            Matcher matcher = mFilter.getPattern().matcher(((diacritical(lowercaseNodeText.toLowerCase()))));
            while (matcher.find()) {
                returnValue.setSpan(new ForegroundColorSpan(highlightColor), matcher.start(0),
                        matcher.end(0), Spannable.SPAN_EXCLUSIVE_INCLUSIVE);
            }
        }
        catch (Exception e){
            e.printStackTrace();
        }

            return returnValue;

    }



 private String diacritical(String original) {
       String removed=null;
           String decomposed = Normalizer.normalize(original, Normalizer.Form.NFD);
           removed = decomposed.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
       return removed;
   }

Test case:

When you give input Ca, it goes to the entire text by displaying all the C content get all the datas and filter out by normalising the content and it matches with accented characters too and display by higlighting.

You already got:

private String convertToBasicLatin(String text) {
    return Normalizer.normalize(text, Normalizer.Form.NFD)
        .replaceAll("\\p{M}", "").replaceAll("\\R", "\n");
}

In order to have one unaccented basic latin char match one Unicode code point of an accented letter, one should normalize the to the composed form:

private String convertToComposedCodePoints(String text) {
    return Normalizer.normalize(text, Normalizer.Form.NFC).replaceAll("\\R", "\n");
}

In general one might make the assumption that the Unicode code point is 1 char long too.

  • The search key uses convertToBasicLatin(sought)
  • The text view's content uses convertToComposedCodePoints(content)
  • The text content for matching uses convertToBasicLatin(content)

Now the matcher's index positions of start and end are correct. I normalized explicitly line endings (regex \R) like \r\n or \u0085 to a single \n. One cannot normalize to lowercase/uppercase, as the number of chars might vary: German lowercase ß corresponds with uppercase SS.

String sought = ...;
String content = ...;

sought = convertToBasicLatin(sought);
String latinContent = convertToBasicLatin(content);
String composedContent = convertToComposedUnicode(content);

Matcher m = Pattern.compile(sought, Pattern.CASE_INSENSITIVE
        | Pattern.UNICODE_CASE | Pattern.UNICODE_CHARACTER_CLASS
        | Pattern.UNIX_LINES)
    .matcher(latinContent);
while (m.find()) {
    ... // One can apply `m.start()` and `m.end()` to composedContent of the view too.
}

I'm not a Java programmer, so just some basic raw regex solution here.

If you can Normalize the string with it's decomposition form
assume it's this

String sSourceTargetDecom = Normalizer.normalize(sourcetarget, Normalizer.Form.NFD);,

that should turn something like 0000C1 Á LATIN CAPITAL LETTER A WITH ACUTE
into two characters A and 000301 ́ COMBINING ACUTE ACCENT.

You can get most combining characters from blocks using

[\p{Block=Combining_Diacritical_Marks}\p{Block=Combining_Diacritical_Marks_Extended}\p{Block=Combining_Diacritical_Marks_For_Symbols}\p{Block=Combining_Diacritical_Marks_Supplement}\p{Block=Combining_Half_Marks}]  

which has a hex range of

[\x{300}-\x{36f}\x{1ab0}-\x{1aff}\x{1dc0}-\x{1dff}\x{20d0}-\x{20ff}\x{fe20}-\x{fe2f}]  

It turns out that most of the combining marks relative to basic Latin that can be
decomposed are in the [\x{300}-\x{36f}] range.

You can decompose both the source target and the input search string.


Then create a regex from the input search string. Inject [\x{300}-\x{36f}]? after each basic Latin letter.

String regex = sSearch.replaceAll("([a-zA-Z])[\\x{300}-\\x{36f}]?", "\\1[\\x{300}-\\x{36f}]?");

(not sure what Java uses for codepoint character notation in their regex, possibly needs to be \u{DD}

Then use the regex on the sSourceTargetDecom string, it will match the basic latin as a stand alone, and/or with an optional combining code.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!