How can I detect non-ascii characters in a vector f strings in a grep like fashion. For example below I\'d like to return c(1, 3)
or c(TRUE, FALSE, TRUE,
Why don't you extract the relevant code from showNonASCII
?
x <- c("façile test of showNonASCII(): details{",
"This is a good line", "This has an ümlaut in it.", "OK again. }")
grepNonASCII <- function(x) {
asc <- iconv(x, "latin1", "ASCII")
ind <- is.na(asc) | asc != x
which(ind)
}
grepNonASCII(x)
#[1] 1 3
A bit late I guess but it could be useful for the next readers.
You can find these functions:
showNonASCII(<character_vector>)
showNonASCIIfile(<file>)
in the tools
R package (see https://stat.ethz.ch/R-manual/R-devel/library/tools/html/showNonASCII.html). It does exactly what is asked here: show non ASCII characters in a string or in a text file.
another possible way is to try to convert your string to ASCII and the try to detect all the generated non printable control characters which couldn't be converted
grepl("[[:cntrl:]]", stringi::stri_enc_toascii(x))
## [1] TRUE FALSE TRUE FALSE
Though it seems stringi
has a built in function for this type of things too
stringi::stri_enc_mark(x)
# [1] "latin1" "ASCII" "latin1" "ASCII"
Came across this later using pure base regex and so simple:
grepl("[^ -~]", x)
## [1] TRUE FALSE TRUE FALSE
More here: http://www.catonmat.net/blog/my-favorite-regex/