Given mixed accented and normal characters in string not working in java when searching

问题

String text = "Cámélan discovered ônte red aleŕt \n Como se extingue la deuda";

If I give the input Ca, it should highlight from the given string Cá but it's not highlighting.

Below is what I tried.

 Pattern mPattern; 
  String filterTerm; //this is the input which I give from input filter. Say for eg: Ca
   String regex = createFilterRegex(filterTerm);
        mPattern = Pattern.compile(regex);

 private String createFilterRegex(String filterTerm) {
        filterTerm = Normalizer.normalize(filterTerm, Normalizer.Form.NFD);
       filterTerm = filterTerm.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
        return filterTerm;
    }

public Pattern getPattern() {
        return mPattern;
    }

In another class,

private SpannableStringBuilder createHighlightedString(String nodeText, int highlightColor) { //nodeText is the entire list displaying. 
        SpannableStringBuilder returnValue = new SpannableStringBuilder(nodeText);
        String lowercaseNodeText = nodeText;
        Matcher matcher = mFilter.getPattern().matcher((createFilterRegex(lowercaseNodeText)));
        while (matcher.find()) {
            returnValue.setSpan(new ForegroundColorSpan(highlightColor), matcher.start(0),
                    matcher.end(0), Spannable.SPAN_EXCLUSIVE_INCLUSIVE);
        }

        return returnValue;
    }

viewHolder.mTextView.setText(createHighlightedString((node.getText()), mHighlightColor));

But what I am getting the output as,

If I type single alphabet o alone, it's highlighting but if I pass more than two alphabets say for eg: Ca, it's not highlighting and displaying. I couldn't figure out what mistake I am doing.

But if you look WhatsApp. it has been achieved.

I typed Co, it's recognizing and highlighting accented characters in the sentence.

回答1:

As you said,

String text = "Cámélan discovered ônte red aleŕt \n Como se extingue la deuda";

So whenever you give first input, receive that first character and compare.

Eg: If you give Ca, then

if (StringUtils.isNotEmpty(substring)) { //this is the search text
substring=substring.substring(0,1); //now you get C alone.

}

So whatever you type it displays by filtering the first character. Now

 SpannableString builder = higlightString((yourContent.getText()), mHighlightColor);
    viewHolder.mTextView.setText(builder);




private SpannableString higlightString(String entireContent, int highlightColor) {
            SpannableString returnValue = new SpannableString(entireContent);

            String lowercaseNodeText = entireContent;
        try {
            Matcher matcher = mFilter.getPattern().matcher(((diacritical(lowercaseNodeText.toLowerCase()))));
            while (matcher.find()) {
                returnValue.setSpan(new ForegroundColorSpan(highlightColor), matcher.start(0),
                        matcher.end(0), Spannable.SPAN_EXCLUSIVE_INCLUSIVE);
            }
        }
        catch (Exception e){
            e.printStackTrace();
        }

            return returnValue;

    }



 private String diacritical(String original) {
       String removed=null;
           String decomposed = Normalizer.normalize(original, Normalizer.Form.NFD);
           removed = decomposed.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
       return removed;
   }

Test case:

When you give input Ca, it goes to the entire text by displaying all the C content get all the datas and filter out by normalising the content and it matches with accented characters too and display by higlighting.

回答2:

You already got:

private String convertToBasicLatin(String text) {
    return Normalizer.normalize(text, Normalizer.Form.NFD)
        .replaceAll("\\p{M}", "").replaceAll("\\R", "\n");
}

In order to have one unaccented basic latin char match one Unicode code point of an accented letter, one should normalize the to the composed form:

private String convertToComposedCodePoints(String text) {
    return Normalizer.normalize(text, Normalizer.Form.NFC).replaceAll("\\R", "\n");
}

In general one might make the assumption that the Unicode code point is 1 char long too.

The search key uses convertToBasicLatin(sought)
The text view's content uses convertToComposedCodePoints(content)
The text content for matching uses convertToBasicLatin(content)

Now the matcher's index positions of start and end are correct. I normalized explicitly line endings (regex \R) like \r\n or \u0085 to a single \n. One cannot normalize to lowercase/uppercase, as the number of chars might vary: German lowercase ß corresponds with uppercase SS.

String sought = ...;
String content = ...;

sought = convertToBasicLatin(sought);
String latinContent = convertToBasicLatin(content);
String composedContent = convertToComposedUnicode(content);

Matcher m = Pattern.compile(sought, Pattern.CASE_INSENSITIVE
        | Pattern.UNICODE_CASE | Pattern.UNICODE_CHARACTER_CLASS
        | Pattern.UNIX_LINES)
    .matcher(latinContent);
while (m.find()) {
    ... // One can apply `m.start()` and `m.end()` to composedContent of the view too.
}

回答3:

I'm not a Java programmer, so just some basic raw regex solution here.

If you can Normalize the string with it's decomposition form
assume it's this

String sSourceTargetDecom = Normalizer.normalize(sourcetarget, Normalizer.Form.NFD);,

that should turn something like 0000C1 Á LATIN CAPITAL LETTER A WITH ACUTE
into two characters A and 000301 ́ COMBINING ACUTE ACCENT.

You can get most combining characters from blocks using

[\p{Block=Combining_Diacritical_Marks}\p{Block=Combining_Diacritical_Marks_Extended}\p{Block=Combining_Diacritical_Marks_For_Symbols}\p{Block=Combining_Diacritical_Marks_Supplement}\p{Block=Combining_Half_Marks}]

which has a hex range of

[\x{300}-\x{36f}\x{1ab0}-\x{1aff}\x{1dc0}-\x{1dff}\x{20d0}-\x{20ff}\x{fe20}-\x{fe2f}]

It turns out that most of the combining marks relative to basic Latin that can be
decomposed are in the [\x{300}-\x{36f}] range.

You can decompose both the source target and the input search string.

Then create a regex from the input search string. Inject [\x{300}-\x{36f}]? after each basic Latin letter.

String regex = sSearch.replaceAll("([a-zA-Z])[\\x{300}-\\x{36f}]?", "\\1[\\x{300}-\\x{36f}]?");

(not sure what Java uses for codepoint character notation in their regex, possibly needs to be \u{DD}

Then use the regex on the sSourceTargetDecom string, it will match the basic latin as a stand alone, and/or with an optional combining code.

来源：https://stackoverflow.com/questions/52835775/given-mixed-accented-and-normal-characters-in-string-not-working-in-java-when-se

标签

java

android

regex

pattern-matching

matcher