I would like to convert HTML character entities like
& to & o
While the solution by Jeroen does the job, it has the disadvantage that it is not vectorised and therefore slow if applied to a large number of characters. In addition, it only works with a character vector of length one and one has to use sapply for a longer character vector.
To demonstrate this, I first create a large character vector:
set.seed(123)
strings <- c("abcd", "& ' >", "&", "€ <")
many_strings <- sample(strings, 10000, replace = TRUE)
And apply the function:
unescape_html <- function(str) {
xml2::xml_text(xml2::read_html(paste0("", str, " ")))
}
system.time(res <- sapply(many_strings, unescape_html, USE.NAMES = FALSE))
## user system elapsed
## 2.327 0.000 2.326
head(res)
## [1] "& ' >" "€ <" "& ' >" "€ <" "€ <" "abcd"
It is much faster if all the strings in the character vector are combined into a single, large string, such that read_html() and xml_text() need only be used once. The strings can then easily be separated again using strsplit():
unescape_html2 <- function(str){
html <- paste0("", paste0(str, collapse = "#_|"), " ")
parsed <- xml2::xml_text(xml2::read_html(html))
strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}
system.time(res2 <- unescape_html2(many_strings))
## user system elapsed
## 0.011 0.000 0.010
identical(res, res2)
## [1] TRUE
Of course, you need to be careful that the string that you use to combine the various strings in str ("#_|" in my example) does not appear anywhere in str. Otherwise, you will introduce an error, when the large string is split again in the end.