Here's the link of my data.
My target is to assign "NA" to all blank cells irrespective of categorical or numerical values. I am using na.strings="". But it's not assigning NA to all blank cells.
## reading the data
dat <- read.csv("data2.csv")
head(dat)
mon hr acc alc sex spd axles door reg cond1 drug1
1 8 21 No Control TRUE F 0 2 2 Physical Impairment (Eyes, Ear, Limb) A
2 7 20 No Control FALSE M 900 2 2 Inattentive D
3 3 9 No Control FALSE F 100 2 2 2004 Normal D
4 1 15 No Control FALSE M 0 2 2 Physical Impairment (Eyes, Ear, Limb) D
5 4 21 No Control FALSE 25 NA NA D
6 4 20 No Control NA F 30 2 4 Drinking Alcohol - Impaired D
inj1 PED_STATE st rac1
1 Fatal <NA> F <NA>
2 Moderate <NA> F <NA>
3 Moderate <NA> M <NA>
4 Complaint <NA> M <NA>
5 Complaint <NA> F <NA>
6 Moderate <NA> M <NA>
## using na.strings
dat2 <- read.csv("data2.csv", header=T, na.strings="")
head(dat2)
mon hr acc alc sex spd axles door reg cond1 drug1
1 8 21 No Control TRUE F 0 2 2 <NA> Physical Impairment (Eyes, Ear, Limb) A
2 7 20 No Control FALSE M 900 2 2 <NA> Inattentive D
3 3 9 No Control FALSE F 100 2 2 2004 Normal D
4 1 15 No Control FALSE M 0 2 2 <NA> Physical Impairment (Eyes, Ear, Limb) D
5 4 21 No Control FALSE 25 NA NA <NA> <NA> D
6 4 20 No Control NA F 30 2 4 <NA> Drinking Alcohol - Impaired D
inj1 PED_STATE st rac1
1 Fatal NA F NA
2 Moderate NA F NA
3 Moderate NA M NA
4 Complaint NA M NA
5 Complaint NA F NA
6 Moderate NA M NA
I'm assuming you are talking about row 5 column "sex." It could be the case that in the data2.csv file, the cell contains a space and hence is not considered empty by R.
Also, I noticed that in row 5 columns "axles" and "door", the original values read from data2.csv are string "NA". You probably want to treat those as na.strings as well. To do this,
dat2 <- read.csv("data2.csv", header=T, na.strings=c("","NA"))
EDIT:
I downloaded your data2.csv. Yes, there is a space in row 5 column "sex". So you want
na.strings=c(""," ","NA")
You can use gsub to replace multiple mutations of empty, like "" or a space, to be NA:
data= data.frame(cats=c('', ' ', 'meow'), dogs=c("woof", " ", NA))
apply(data, 2, function(x) gsub("^$|^ $", NA, x))
A more eye-friendly solution using dplyr
would be
require(dplyr)
## fake blank cells
iris[1,1]=""
## define a helper function
empty_as_na <- function(x){
if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
ifelse(as.character(x)!="", x, NA)
}
## transform all columns
iris %>% mutate_each(funs(empty_as_na))
To apply the correction to just a subset of columns you can specify columns of interest using dplyr's column matching syntax. Example:mutate_each(funs(empty_as_na), matches("Width"), Species)
In case you table contains dates you should consider using a more typesafe version of ifelse
I recently ran into similar issues. This is what worked for me, if the variable is numeric, then a simple df$Var[df$Var == ""] <- "NA"
should suffice. But if the variable is a factor, then you need to convert it to character first, then replace ""
cells with the value you want, and convert it back to factor. So case in point, your Sex variable, I assume it would be a factor and if you want to replace the empty cell, I would do the following:
df$Var <- as.character(df$Var)
df$Var[df$Var==""] <- "NA"
df$Var <- as.factor(df$Var)
This should do the trick
dat <- dat %>% mutate_all(na_if,"")
My function takes into account factor, character vector and potential attributes, if you use haven or foreign package to read external files. Also it allows matching different self-defined na.strings. To transform all columns, simply use lappy: df[] = lapply(df, blank2na, na.strings=c('','NA','na','N/A','n/a','NaN','nan'))
See more the comments:
#' Replaces blank-ish elements of a factor or character vector to NA
#' @description Replaces blank-ish elements of a factor or character vector to NA
#' @param x a vector of factor or character or any type
#' @param na.strings case sensitive strings that will be coverted to NA. The function will do a trimws(x,'both') before conversion. If NULL, do only trimws, no conversion to NA.
#' @return Returns a vector trimws (always for factor, character) and NA converted (if matching na.strings). Attributes will also be kept ('label','labels', 'value.labels').
#' @seealso \code{\link{ez.nan2na}}
#' @export
blank2na = function(x,na.strings=c('','.','NA','na','N/A','n/a','NaN','nan')) {
if (is.factor(x)) {
lab = attr(x, 'label', exact = T)
labs1 <- attr(x, 'labels', exact = T)
labs2 <- attr(x, 'value.labels', exact = T)
# trimws will convert factor to character
x = trimws(x,'both')
if (! is.null(lab)) lab = trimws(lab,'both')
if (! is.null(labs1)) labs1 = trimws(labs1,'both')
if (! is.null(labs2)) labs2 = trimws(labs2,'both')
if (!is.null(na.strings)) {
# convert to NA
x[x %in% na.strings] = NA
# also remember to remove na.strings from value labels
labs1 = labs1[! labs1 %in% na.strings]
labs2 = labs2[! labs2 %in% na.strings]
}
# the levels will be reset here
x = factor(x)
if (! is.null(lab)) attr(x, 'label') <- lab
if (! is.null(labs1)) attr(x, 'labels') <- labs1
if (! is.null(labs2)) attr(x, 'value.labels') <- labs2
} else if (is.character(x)) {
lab = attr(x, 'label', exact = T)
labs1 <- attr(x, 'labels', exact = T)
labs2 <- attr(x, 'value.labels', exact = T)
# trimws will convert factor to character
x = trimws(x,'both')
if (! is.null(lab)) lab = trimws(lab,'both')
if (! is.null(labs1)) labs1 = trimws(labs1,'both')
if (! is.null(labs2)) labs2 = trimws(labs2,'both')
if (!is.null(na.strings)) {
# convert to NA
x[x %in% na.strings] = NA
# also remember to remove na.strings from value labels
labs1 = labs1[! labs1 %in% na.strings]
labs2 = labs2[! labs2 %in% na.strings]
}
if (! is.null(lab)) attr(x, 'label') <- lab
if (! is.null(labs1)) attr(x, 'labels') <- labs1
if (! is.null(labs2)) attr(x, 'value.labels') <- labs2
} else {
x = x
}
return(x)
}
While many options above function well, I found coercion of non-target variables to chr
problematic. Using ifelse
and grepl
within lapply
resolves this off-target effect (in limited testing). Using slarky's regular expression in grepl
:
set.seed(42)
x1 <- sample(c("a","b"," ", "a a", NA), 10, TRUE)
x2 <- sample(c(rnorm(length(x1),0, 1), NA), length(x1), TRUE)
df <- data.frame(x1, x2, stringsAsFactors = FALSE)
The problem of coercion to character class:
df2 <- lapply(df, function(x) gsub("^$|^ $", NA, x))
lapply(df2, class)
$x1
[1] "character"
$x2 [1] "character"
Resolution with use of ifelse:
df3 <- lapply(df, function(x) ifelse(grepl("^$|^ $", x)==TRUE, NA, x))
lapply(df3, class)
$x1
[1] "character"
$x2 [1] "numeric"
You can also use mutate_at
in dplyr
dat <- dat %>%
mutate_at(vars(colnames(.)),
.funs = funs(ifelse(.=="", NA, as.character(.))))
Select individual columns to change:
dat <- dat %>%
mutate_at(vars(colnames(.)[names(.) %in% c("Age","Gender")]),
.funs = funs(ifelse(.=="", NA, as.character(.))))
As of (dplyr 0.8.0 above) the way this should be written has changed. Before it was, funs()
in .funs (funs(name = f(.))
. Instead of funs
, now we use list (list(name = ~f(.)))
Note that there is also a much simpler way to list the column names ! (both the name of the column and column index work)
dat <- dat %>%
mutate_at(.vars = c("Age","Gender"),
.funs = list(~ifelse(.=="", NA, as.character(.))))
Couldn't you just use
dat <- read.csv("data2.csv",na.strings=" ",header=TRUE)
should convert all blanks to NA as the data are read in be sure to put a space between your quotation
Call dplyr
package by installing from cran
in r
library(dplyr)
(file)$(colname)<-sub("-",NA,file$colname)
It will convert all the blank cell in a particular column as NA
If the column contains "-", "", 0 like this change it in code according to the type of blank cell
E.g. if I get a blank cell like "" instead of "-", then use this code:
(file)$(colname)<-sub("", NA, file$colname)
来源:https://stackoverflow.com/questions/24172111/change-the-blank-cells-to-na