I am doing a lot of analysis with the TM package. One of my biggest problems are related to stemming and stemming-like transformations.
Let\'s say I hav
We could set up a list of synonyms and replace those values. For example
synonyms <- list(
list(word="account", syns=c("acount", "accounnt"))
)
This says we want to replace "acount" and "accounnt" with "account" (i'm assuming we're doing this after stemming). Now let's create test data.
raw<-c("accounts", "account", "accounting", "acounting",
"acount", "acounts", "accounnt")
And now let's define a transformation function that will replace the words in our list with the primary synonym.
library(tm)
replaceSynonyms <- content_transformer(function(x, syn=NULL) {
Reduce(function(a,b) {
gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word, a)}, syn, x)
})
Here we use the content_transformer function to define a custom transformation. And basically we just do a gsub to replace each of the words. We can then use this on a corpus
tm <- Corpus(VectorSource(raw))
tm <- tm_map(tm, stemDocument)
tm <- tm_map(tm, replaceSynonyms, synonyms)
inspect(tm)
and we can see all these values are transformed into "account" as desired. To add other synonyms, just add additional lists to the main synonyms list. Each sub-list should have the names "word" and "syns".