How to write custom removePunctuation() function to better deal with Unicode chars?
In the source code of the tm text-mining R-package, in file transform.R , there is the removePunctuation() function, currently defined as: function(x, preserve_intra_word_dashes = FALSE) { if (!preserve_intra_word_dashes) gsub("[[:punct:]]+", "", x) else { # Assume there are no ASCII 1 characters. x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x) x <- gsub("[[:punct:]]+", "", x) gsub("\1", "-", x, fixed = TRUE) } } I need to parse and mine some abstracts from a science conference (fetched from their website as UTF-8). The abstracts contain some unicode characters that need to be removed, particularly at