Given mixed accented and normal characters in string not working in java when searching

限于喜欢 提交于 2019-12-22 04:09:16

问题


String text = "Cámélan discovered ônte red aleŕt \n Como se extingue la deuda";

If I give the input Ca, it should highlight from the given string Cá but it's not highlighting.

Below is what I tried.

 Pattern mPattern; 
  String filterTerm; //this is the input which I give from input filter. Say for eg: Ca
   String regex = createFilterRegex(filterTerm);
        mPattern = Pattern.compile(regex);

 private String createFilterRegex(String filterTerm) {
        filterTerm = Normalizer.normalize(filterTerm, Normalizer.Form.NFD);
       filterTerm = filterTerm.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
        return filterTerm;
    }

public Pattern getPattern() {
        return mPattern;
    }

In another class,

private SpannableStringBuilder createHighlightedString(String nodeText, int highlightColor) { //nodeText is the entire list displaying. 
        SpannableStringBuilder returnValue = new SpannableStringBuilder(nodeText);
        String lowercaseNodeText = nodeText;
        Matcher matcher = mFilter.getPattern().matcher((createFilterRegex(lowercaseNodeText)));
        while (matcher.find()) {
            returnValue.setSpan(new ForegroundColorSpan(highlightColor), matcher.start(0),
                    matcher.end(0), Spannable.SPAN_EXCLUSIVE_INCLUSIVE);
        }

        return returnValue;
    }

viewHolder.mTextView.setText(createHighlightedString((node.getText()), mHighlightColor));

But what I am getting the output as,

If I type single alphabet o alone, it's highlighting but if I pass more than two alphabets say for eg: Ca, it's not highlighting and displaying. I couldn't figure out what mistake I am doing.

But if you look WhatsApp. it has been achieved.

I typed Co, it's recognizing and highlighting accented characters in the sentence.


回答1:


As you said,

String text = "Cámélan discovered ônte red aleŕt \n Como se extingue la deuda";

So whenever you give first input, receive that first character and compare.

Eg: If you give Ca, then

if (StringUtils.isNotEmpty(substring)) { //this is the search text
substring=substring.substring(0,1); //now you get C alone.

}

So whatever you type it displays by filtering the first character. Now

 SpannableString builder = higlightString((yourContent.getText()), mHighlightColor);
    viewHolder.mTextView.setText(builder);




private SpannableString higlightString(String entireContent, int highlightColor) {
            SpannableString returnValue = new SpannableString(entireContent);

            String lowercaseNodeText = entireContent;
        try {
            Matcher matcher = mFilter.getPattern().matcher(((diacritical(lowercaseNodeText.toLowerCase()))));
            while (matcher.find()) {
                returnValue.setSpan(new ForegroundColorSpan(highlightColor), matcher.start(0),
                        matcher.end(0), Spannable.SPAN_EXCLUSIVE_INCLUSIVE);
            }
        }
        catch (Exception e){
            e.printStackTrace();
        }

            return returnValue;

    }



 private String diacritical(String original) {
       String removed=null;
           String decomposed = Normalizer.normalize(original, Normalizer.Form.NFD);
           removed = decomposed.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
       return removed;
   }

Test case:

When you give input Ca, it goes to the entire text by displaying all the C content get all the datas and filter out by normalising the content and it matches with accented characters too and display by higlighting.




回答2:


You already got:

private String convertToBasicLatin(String text) {
    return Normalizer.normalize(text, Normalizer.Form.NFD)
        .replaceAll("\\p{M}", "").replaceAll("\\R", "\n");
}

In order to have one unaccented basic latin char match one Unicode code point of an accented letter, one should normalize the to the composed form:

private String convertToComposedCodePoints(String text) {
    return Normalizer.normalize(text, Normalizer.Form.NFC).replaceAll("\\R", "\n");
}

In general one might make the assumption that the Unicode code point is 1 char long too.

  • The search key uses convertToBasicLatin(sought)
  • The text view's content uses convertToComposedCodePoints(content)
  • The text content for matching uses convertToBasicLatin(content)

Now the matcher's index positions of start and end are correct. I normalized explicitly line endings (regex \R) like \r\n or \u0085 to a single \n. One cannot normalize to lowercase/uppercase, as the number of chars might vary: German lowercase ß corresponds with uppercase SS.

String sought = ...;
String content = ...;

sought = convertToBasicLatin(sought);
String latinContent = convertToBasicLatin(content);
String composedContent = convertToComposedUnicode(content);

Matcher m = Pattern.compile(sought, Pattern.CASE_INSENSITIVE
        | Pattern.UNICODE_CASE | Pattern.UNICODE_CHARACTER_CLASS
        | Pattern.UNIX_LINES)
    .matcher(latinContent);
while (m.find()) {
    ... // One can apply `m.start()` and `m.end()` to composedContent of the view too.
}



回答3:


I'm not a Java programmer, so just some basic raw regex solution here.

If you can Normalize the string with it's decomposition form
assume it's this

String sSourceTargetDecom = Normalizer.normalize(sourcetarget, Normalizer.Form.NFD);,

that should turn something like 0000C1 Á LATIN CAPITAL LETTER A WITH ACUTE
into two characters A and 000301 ́ COMBINING ACUTE ACCENT.

You can get most combining characters from blocks using

[\p{Block=Combining_Diacritical_Marks}\p{Block=Combining_Diacritical_Marks_Extended}\p{Block=Combining_Diacritical_Marks_For_Symbols}\p{Block=Combining_Diacritical_Marks_Supplement}\p{Block=Combining_Half_Marks}]  

which has a hex range of

[\x{300}-\x{36f}\x{1ab0}-\x{1aff}\x{1dc0}-\x{1dff}\x{20d0}-\x{20ff}\x{fe20}-\x{fe2f}]  

It turns out that most of the combining marks relative to basic Latin that can be
decomposed are in the [\x{300}-\x{36f}] range.

You can decompose both the source target and the input search string.


Then create a regex from the input search string. Inject [\x{300}-\x{36f}]? after each basic Latin letter.

String regex = sSearch.replaceAll("([a-zA-Z])[\\x{300}-\\x{36f}]?", "\\1[\\x{300}-\\x{36f}]?");

(not sure what Java uses for codepoint character notation in their regex, possibly needs to be \u{DD}

Then use the regex on the sSourceTargetDecom string, it will match the basic latin as a stand alone, and/or with an optional combining code.



来源:https://stackoverflow.com/questions/52835775/given-mixed-accented-and-normal-characters-in-string-not-working-in-java-when-se

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!