How to write custom removePunctuation() function to better deal with Unicode chars?

非 Y 不嫁゛ 提交于 2019-11-30 14:21:46

As much as I like Susana's answer it is breaking the Corpus in newer versions of tm (No longer a PlainTextDocument and destroying the meta)

You will get a list and the following error:

Error in UseMethod("meta", x) : 
no applicable method for 'meta' applied to an object of class "character"

Using

tm_map(your_corpus, PlainTextDocument)

will give you back your corpus but with broken $meta (in particular document ids will be missing.

Solution

Use content_transformer

toSpace <- content_transformer(function(x,pattern)
    gsub(pattern," ", x))
your_corpus <- tm_map(your_corpus,toSpace,"„")

Source: Hands-On Data Science with R, Text Mining, Graham.Williams@togaware.com http://onepager.togaware.com/

Update

This function removes everything that is not alpha numeric (i.e. UTF-8 emoticons etc.)

removeNonAlnum <- function(x){
  gsub("[^[:alnum:]^[:space:]]","",x)
}

I had the same problem, custom function was not working, but actually the first line below has to be added

Regards

Susana

replaceExpressions <- function(x) UseMethod("replaceExpressions", x)

replaceExpressions.PlainTextDocument <- replaceExpressions.character  <- function(x) {
    x <- gsub(".", " ", x, ignore.case =FALSE, fixed = TRUE)
    x <- gsub(",", " ", x, ignore.case =FALSE, fixed = TRUE)
    x <- gsub(":", " ", x, ignore.case =FALSE, fixed = TRUE)
    return(x)
}

notes_pre_clean <- tm_map(notes, replaceExpressions, useMeta = FALSE)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!