Remove unicode from Corpus text

问题

I'm having a pretty stubborn issue... I can't seem to remove the <+f0b7> and <+f0a0> string from Corpora that have loaded from *.txt files into R:

UPDATE Here's a link to the sample .txt file: https://db.tt/qTRKpJYK

Corpus(DirSource("./SomeDirectory/txt/"), readerControl = list(reader = readPlain))

title
 professional staff - contract - permanent position
software c microfocus cobol unix btrieve ibm vm-cms vsam cics jcl
accomplishments
 <+f0b7>
<+f0a0>
responsible maintaining billing system interfaced cellular switching system <+f0b7>
<+f0a0>
developed unix interface ibm mainframe ericsson motorola att cellular switches

I've tried adding it to:

badWords <- unique(c(stopwords("en"), 
          stopwords("SMART")[stopwords("SMART") != "c"],
          as.character(1970:2050),
          "<U+F0B7>", "<+f0b7>",
          "<U+F0A0>", "<+f0a0>",
          "january",  "jan",
          "february",   "feb",
          "march",  "mar",
          "april",  "apr",
          "may",    "may",
          "june",   "jun",
          "july",   "jul",
          "august", "aug",
          "september",  "sep",
          "october",    "oct",
          "november",   "nov",
          "december",   "dec"))

And using:

tm_map(candidates.Corpus, removeWords, badWords)

But that doesn't work somehow. I've also tried to regexp it out with something like gsub("<+f0a0>", "", tmp, perl = FALSE), and that works on a string within R, but somehow these characters are still showing up when I read a .txt file.

Is there something unique about these characters? How do I get rid of them?

回答1:

Ok. The problem is that your data has an unusual unicode character in it. In R, we typically escape this character as "\uf0b7". But when inspect() prints it's data, it encodes it as "". Observe

sample<-c("Crazy \uf0b7 Character")
cp<-Corpus(VectorSource(sample))
inspect(DocumentTermMatrix(cp))

# A document-term matrix (1 documents, 3 terms)
# 
# Non-/sparse entries: 3/0
# Sparsity           : 0%
# Maximal term length: 9 
# Weighting          : term frequency (tf)
# 
#     Terms
# Docs <U+F0B7> character crazy
#    1        1         1     1

(actually i had to create this output on a Windows machine running R 3.0.2 - it worked fine on my Mac running R 3.1.0).

Unfortunately you will not be able to remove this with remove words because the regular expression used in that function required that word boundaries appear on both sides of the "word" and since this doesn't seem to be a recognized character for a boundary. See

gsub("\uf0b7","",sample)
# [1] "Crazy  Character"
gsub("\\b\uf0b7\\b","",sample)
#[1] "Crazy  Character"

So we can write our own function we can use with tm_map. Consider

removeCharacters <-function (x, characters)  {
gsub(sprintf("(*UCP)(%s)", paste(characters, collapse = "|")), "", x, perl = TRUE)
}

which is basically the removeWords function just without the boundary conditions. Then we can run

cp2 <- tm_map(cp, removeCharacters, c("\uf0b7","\uf0a0"))
inspect(DocumentTermMatrix(cp2))

# A document-term matrix (1 documents, 2 terms)
# 
# Non-/sparse entries: 2/0
# Sparsity           : 0%
# Maximal term length: 9 
# Weighting          : term frequency (tf)
# 
#     Terms
# Docs character crazy
#    1         1     1

and we see those unicode characters are no longer there.

来源：https://stackoverflow.com/questions/24147816/remove-unicode-f0b7-from-corpus-text

标签

Remove unicode <+f0b7> from Corpus text

问题

回答1: