Removing non-ASCII characters from data files

喜夏-厌秋 提交于 2019-11-26 06:28:57

问题


I\'ve got a bunch of csv files that I\'m reading into R and including in a package/data folder in .rdata format. Unfortunately the non-ASCII characters in the data fail the check. The tools package has two functions to check for non-ASCII characters (showNonASCII and showNonASCIIfile) but I can\'t seem to locate one to remove/clean them.

Before I explore other UNIX tools, it would be great to do this all in R so I can maintain a complete workflow from raw data to final product. Are there any existing packages/functions to help me get rid of the non-ASCII characters?


回答1:


To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1"  # (just to make sure)
x
# [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm"        "Jreskog"       "bichen Zrcher"

To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

## Do *any* lines contain non-ASCII characters? 
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE

## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3



回答2:


These days, a slightly better approach is to use the stringi package which provides a function for general unicode conversion. This allows you to preserve the original text as much as possible:

x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")
x
#> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

stringi::stri_trans_general(x, "latin-ascii")
#> [1] "Ekstrom"          "Joreskog"         "bisschen Zurcher"



回答3:


To remove all words with non-ascii characters (borrowing code from @Hadley), you can use the package xfun with filter from dplyr

x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher", "alex")
x

x %>% 
  tibble(name = .) %>%
  filter(xfun::is_ascii(name)== T)


来源:https://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!