Remove ✅,

前端 未结 7 1777
离开以前
离开以前 2020-11-28 20:03

I have some strings with all kinds of different emojis/images/signs in them.

Not all the strings are in English -- some of them are in other non-Latin languages, for

相关标签:
7条回答
  • 2020-11-28 21:02

    I'm not super into Java, so I won't try to write example code inline, but the way I would do this is to check what Unicode calls "the general category" of each character. There are a couple letter and punctuation categories.

    You can use Character.getType to find the general category of a given character. You should probably retain those characters that fall in these general categories:

    COMBINING_SPACING_MARK
    CONNECTOR_PUNCTUATION
    CURRENCY_SYMBOL
    DASH_PUNCTUATION
    DECIMAL_DIGIT_NUMBER
    ENCLOSING_MARK
    END_PUNCTUATION
    FINAL_QUOTE_PUNCTUATION
    FORMAT
    INITIAL_QUOTE_PUNCTUATION
    LETTER_NUMBER
    LINE_SEPARATOR
    LOWERCASE_LETTER
    MATH_SYMBOL
    MODIFIER_LETTER
    MODIFIER_SYMBOL
    NON_SPACING_MARK
    OTHER_LETTER
    OTHER_NUMBER
    OTHER_PUNCTUATION
    PARAGRAPH_SEPARATOR
    SPACE_SEPARATOR
    START_PUNCTUATION
    TITLECASE_LETTER
    UPPERCASE_LETTER
    

    (All of the characters you listed as specifically wanting to remove have general category OTHER_SYMBOL, which I did not include in the above category whitelist.)

    0 讨论(0)
提交回复
热议问题