问题
I'm having a pretty stubborn issue... I can't seem to remove the <+f0b7>
and <+f0a0>
string from Corpora that have loaded from *.txt
files into R:
UPDATE Here's a link to the sample .txt
file: https://db.tt/qTRKpJYK
Corpus(DirSource("./SomeDirectory/txt/"), readerControl = list(reader = readPlain))
title
professional staff - contract - permanent position
software c microfocus cobol unix btrieve ibm vm-cms vsam cics jcl
accomplishments
<+f0b7>
<+f0a0>
responsible maintaining billing system interfaced cellular switching system <+f0b7>
<+f0a0>
developed unix interface ibm mainframe ericsson motorola att cellular switches
I've tried adding it to:
badWords <- unique(c(stopwords("en"),
stopwords("SMART")[stopwords("SMART") != "c"],
as.character(1970:2050),
"<U+F0B7>", "<+f0b7>",
"<U+F0A0>", "<+f0a0>",
"january", "jan",
"february", "feb",
"march", "mar",
"april", "apr",
"may", "may",
"june", "jun",
"july", "jul",
"august", "aug",
"september", "sep",
"october", "oct",
"november", "nov",
"december", "dec"))
And using:
tm_map(candidates.Corpus, removeWords, badWords)
But that doesn't work somehow. I've also tried to regexp it out with something like gsub("<+f0a0>", "", tmp, perl = FALSE)
, and that works on a string within R, but somehow these characters are still showing up when I read a .txt
file.
Is there something unique about these characters? How do I get rid of them?
回答1:
Ok. The problem is that your data has an unusual unicode character in it. In R, we typically escape this character as "\uf0b7". But when inspect()
prints it's data, it encodes it as "". Observe
sample<-c("Crazy \uf0b7 Character")
cp<-Corpus(VectorSource(sample))
inspect(DocumentTermMatrix(cp))
# A document-term matrix (1 documents, 3 terms)
#
# Non-/sparse entries: 3/0
# Sparsity : 0%
# Maximal term length: 9
# Weighting : term frequency (tf)
#
# Terms
# Docs <U+F0B7> character crazy
# 1 1 1 1
(actually i had to create this output on a Windows machine running R 3.0.2 - it worked fine on my Mac running R 3.1.0).
Unfortunately you will not be able to remove this with remove words because the regular expression used in that function required that word boundaries appear on both sides of the "word" and since this doesn't seem to be a recognized character for a boundary. See
gsub("\uf0b7","",sample)
# [1] "Crazy Character"
gsub("\\b\uf0b7\\b","",sample)
#[1] "Crazy Character"
So we can write our own function we can use with tm_map
. Consider
removeCharacters <-function (x, characters) {
gsub(sprintf("(*UCP)(%s)", paste(characters, collapse = "|")), "", x, perl = TRUE)
}
which is basically the removeWords function just without the boundary conditions. Then we can run
cp2 <- tm_map(cp, removeCharacters, c("\uf0b7","\uf0a0"))
inspect(DocumentTermMatrix(cp2))
# A document-term matrix (1 documents, 2 terms)
#
# Non-/sparse entries: 2/0
# Sparsity : 0%
# Maximal term length: 9
# Weighting : term frequency (tf)
#
# Terms
# Docs character crazy
# 1 1 1
and we see those unicode characters are no longer there.
来源:https://stackoverflow.com/questions/24147816/remove-unicode-f0b7-from-corpus-text