detect non ascii characters in a string

前端 未结 4 1789
暗喜
暗喜 2020-12-09 08:24

How can I detect non-ascii characters in a vector f strings in a grep like fashion. For example below I\'d like to return c(1, 3) or c(TRUE, FALSE, TRUE,

相关标签:
4条回答
  • 2020-12-09 09:05

    Why don't you extract the relevant code from showNonASCII?

    x <- c("façile test of showNonASCII(): details{", 
           "This is a good line", "This has an ümlaut in it.", "OK again. }")
    
    grepNonASCII <- function(x) {
      asc <- iconv(x, "latin1", "ASCII")
      ind <- is.na(asc) | asc != x
      which(ind)
    }
    
    grepNonASCII(x)
    #[1] 1 3
    
    0 讨论(0)
  • 2020-12-09 09:05

    A bit late I guess but it could be useful for the next readers.

    You can find these functions:

    • showNonASCII(<character_vector>)
    • showNonASCIIfile(<file>)

    in the tools R package (see https://stat.ethz.ch/R-manual/R-devel/library/tools/html/showNonASCII.html). It does exactly what is asked here: show non ASCII characters in a string or in a text file.

    0 讨论(0)
  • 2020-12-09 09:17

    another possible way is to try to convert your string to ASCII and the try to detect all the generated non printable control characters which couldn't be converted

    grepl("[[:cntrl:]]", stringi::stri_enc_toascii(x))
    ## [1]  TRUE FALSE  TRUE FALSE
    

    Though it seems stringi has a built in function for this type of things too

    stringi::stri_enc_mark(x)
    # [1] "latin1" "ASCII"  "latin1" "ASCII" 
    
    0 讨论(0)
  • 2020-12-09 09:17

    Came across this later using pure base regex and so simple:

    grepl("[^ -~]", x)
    ## [1]  TRUE FALSE  TRUE FALSE
    

    More here: http://www.catonmat.net/blog/my-favorite-regex/

    0 讨论(0)
提交回复
热议问题