Dealing with Byte Order Mark (BOM) in R [duplicate]

早过忘川 提交于 2021-01-27 07:42:28

问题


Sometimes a Byte Order Mark (BOM) is present at the beginning of a .CSV file. The symbol is not visible when you open the file using Notepad or Excel, however, When you read the file in R using various methods, you will different symbols in the name of first column. here is an example

A sample csv file with BOM in the beginning.

ID,title,clean_title,clean_title_id
1,0 - 0,,0
2,"""0 - 1,000,000""",,0
27448,"20yr. rope walker
igger",Rope Walker Igger,1832700817

Reading through read.csv in base R package

(x1 = read.csv("file1.csv",stringsAsFactors = FALSE))
#   ï..ID                raw_title        semi_clean semi_clean_id
# 1     1                    0 - 0                               0
# 2     2          "0 - 1,000,000"                               0
# 3 27448 20yr. rope walker\nigger Rope Walker Igger    1832700817

Reading through fread in data.table package

(x2 = data.table::fread("file1.csv"))
#    ID                raw_title        semi_clean semi_clean_id
# 1:     1                    0 - 0                               0
# 2:     2        ""0 - 1,000,000""                               0
# 3: 27448 20yr. rope walker\rigger Rope Walker Igger    1832700817

Reading through read_csv in readr package

(x3 = readr::read_csv("file1.csv"))
#   <U+FEFF>ID                raw_title        semi_clean semi_clean_id
# 1          1                    0 - 0              <NA>             0
# 2          2          "0 - 1,000,000"              <NA>             0
# 3      27448 20yr. rope walker\rigger Rope Walker Igger    1832700817

You can notice different characters in front of variable name ID.

Here are the results when you run names on all of these

names(x1)
# [1] "ï..ID"         "raw_title"     "semi_clean"    "semi_clean_id"
names(x2)
# [1] "ID"         "raw_title"     "semi_clean"    "semi_clean_id"
names(x3)
# [1] "ID"             "raw_title"     "semi_clean"    "semi_clean_id"

In x3, there is nothing 'visible' in front of ID, but when you check

names(x3)[[1]]=="ID"
# [1] FALSE

How to get rid of these unwanted character in each case. PS: Please add more methods to read csv files, the problem faced and the solutions.


回答1:


For read.csv in base R use:

x1 = read.csv("file1.csv",stringsAsFactors = FALSE, fileEncoding = "UTF-8-BOM")

For fread, use:

x2 = fread("file1.csv")
setnames(x2, "ID", "ID")

For read_csv, use:

x3 = readr::read_csv("file1.csv")
setDT(X3) #convert into data tables, so that setnames can be used
setnames(x3, "\uFEFFID", "ID")

One non-R based solution is open the file in Notepad++, save the file after change encoding to "Encoding in UTF-8 without BOM"



来源:https://stackoverflow.com/questions/39593637/dealing-with-byte-order-mark-bom-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!