How to remove non-valid unicode characters from strings in java

前端未结

关注

 4  2266

感情败类 2021-02-15 17:18

I am using the CoreNLP Neural Network Dependency Parser to parse some social media content. Unfortunately, the file contains characters which are, according to fileformat.info,

4条回答

渐次进展 (楼主)

2021-02-15 17:21
Remove specific unwanted chars with:
```
document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010]", "");
```
If you found others unwanted chars simply add with the same schema to the list.

UPDATE:

The unicode chars are splitted by the regex engine in 7 macro-groups (and several sub-groups) identified by a one letter (macro-group) or two letters (sub-group).

Basing my arguments on your examples and the unicode classes indicated in the always good resource Regular Expressions Site i think you can try a unique only-good-pass approach such as this:
```
document.replaceAll("[^\\p{L}\\p{N}\\p{Z}\\p{Sm}\\p{Sc}\\p{Sk}\\p{Pi}\\p{Pf}\\p{Pc}\\p{Mc}]","")
```
This regex remove anything that is not:
- \p{L}: a letter in any language
- \p{N}: a number
- \p{Z}: any kind of whitespace or invisible separator
- \p{Sm}\p{Sc}\p{Sk}: Math, Currency or generic marks as single char
- \p{Mc}*: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
- \p{Pi}\p{Pf}\p{Pc}*: Opening quote, Closing quote, words connectors (i.e. underscore)
*: i think these groups can be eligible to be removed as well for the purpose of CoreNPL.

This way you only need a single regex filter and you can handle groups of chars (with the same purpose) instead of single cases.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...