Removing non-English text from Corpus in R using tm()

会有一股神秘感。 提交于 2019-12-18 11:32:39

问题


I am using tm() and wordcloud() for some basic data-mining in R, but am running into difficulties because there are non-English characters in my dataset (even though I've tried to filter out other languages based on background variables.

Let's say that some of the lines in my TXT file (saved as UTF-8 in TextWrangler) look like this:

Special
satisfação
Happy
Sad
Potential für

I then read my txt file into R:

words <- Corpus(DirSource("~/temp", encoding = "UTF-8"),readerControl = list(language = "lat"))

This yields the warning message:

Warning message:
In readLines(y, encoding = x$Encoding) :
  incomplete final line found on '/temp/file.txt'

But since it's a warning, not an error, I continue to push forward.

words <- tm_map(words, stripWhitespace)
words <- tm_map(words, tolower)

This then yields the error:

Error in FUN(X[[1L]], ...) : invalid input 'satisfa��o' in 'utf8towcs'

I'm open to finding ways to filter out the non-English characters either in TextWrangler or R; whatever is the most expedient. Thanks for your help!


回答1:


Here's a method to remove words with non-ASCII characters before making a corpus:

# remove words with non-ASCII characters
# assuming you read your txt file in as a vector, eg. 
# dat <- readLines('~/temp/dat.txt')
dat <- "Special,  satisfação, Happy, Sad, Potential, für"
# convert string to vector of words
dat2 <- unlist(strsplit(dat, split=", "))
# find indices of words with non-ASCII characters
dat3 <- grep("dat2", iconv(dat2, "latin1", "ASCII", sub="dat2"))
# subset original vector of words to exclude words with non-ASCII char
dat4 <- dat2[-dat3]
# convert vector back to a string
dat5 <- paste(dat4, collapse = ", ")
# make corpus
require(tm)
words1 <- Corpus(VectorSource(dat5))
inspect(words1)

A corpus with 1 text document

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
Special, Happy, Sad, Potential



回答2:


You can also use the package "stringi".

Using the above example:

library(stringi)
dat <- "Special,  satisfação, Happy, Sad, Potential, für"
stringi::stri_trans_general(dat, "latin-ascii")

Output:

[1] "Special,  satisfacao, Happy, Sad, Potential, fur"  


来源:https://stackoverflow.com/questions/18153504/removing-non-english-text-from-corpus-in-r-using-tm

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!