发表新帖

发表新帖

Remove ✅,

前端未结

关注

 7  1798

I have some strings with all kinds of different emojis/images/signs in them.

Not all the strings are in English -- some of them are in other non-Latin languages, for

相关标签:

7条回答

被撕碎了的回忆

2020-11-28 21:02
I'm not super into Java, so I won't try to write example code inline, but the way I would do this is to check what Unicode calls "the general category" of each character. There are a couple letter and punctuation categories.

You can use Character.getType to find the general category of a given character. You should probably retain those characters that fall in these general categories:
```
COMBINING_SPACING_MARK
CONNECTOR_PUNCTUATION
CURRENCY_SYMBOL
DASH_PUNCTUATION
DECIMAL_DIGIT_NUMBER
ENCLOSING_MARK
END_PUNCTUATION
FINAL_QUOTE_PUNCTUATION
FORMAT
INITIAL_QUOTE_PUNCTUATION
LETTER_NUMBER
LINE_SEPARATOR
LOWERCASE_LETTER
MATH_SYMBOL
MODIFIER_LETTER
MODIFIER_SYMBOL
NON_SPACING_MARK
OTHER_LETTER
OTHER_NUMBER
OTHER_PUNCTUATION
PARAGRAPH_SEPARATOR
SPACE_SEPARATOR
START_PUNCTUATION
TITLECASE_LETTER
UPPERCASE_LETTER
```
(All of the characters you listed as specifically wanting to remove have general category OTHER_SYMBOL, which I did not include in the above category whitelist.)
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

热议问题