I am using the CoreNLP Neural Network Dependency Parser to parse some social media content. Unfortunately, the file contains characters which are, according to fileformat.info,
Remove specific unwanted chars with:
document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010]", "");
If you found others unwanted chars simply add with the same schema to the list.
UPDATE:
The unicode chars are splitted by the regex engine in 7 macro-groups (and several sub-groups) identified by a one letter (macro-group) or two letters (sub-group).
Basing my arguments on your examples and the unicode classes indicated in the always good resource Regular Expressions Site i think you can try a unique only-good-pass approach such as this:
document.replaceAll("[^\\p{L}\\p{N}\\p{Z}\\p{Sm}\\p{Sc}\\p{Sk}\\p{Pi}\\p{Pf}\\p{Pc}\\p{Mc}]","")
This regex remove anything that is not:
\p{L}: a letter in any language\p{N}: a number\p{Z}: any kind of whitespace or invisible separator\p{Sm}\p{Sc}\p{Sk}: Math, Currency or generic marks as single char\p{Mc}*: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).\p{Pi}\p{Pf}\p{Pc}*: Opening quote, Closing quote, words connectors (i.e. underscore)*: i think these groups can be eligible to be removed as well for the purpose of CoreNPL.
This way you only need a single regex filter and you can handle groups of chars (with the same purpose) instead of single cases.