how do I remove question mark(?) from a data set in R

问题

Hello everyone I am analysing UCI adult census data. The data has question marks (?) for every missing value.

I want to replace all the question marks with NA.

i tried:

library(XML)
census<-read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",header=F,na.strings="?")
names(census)<-c("Age","Workclass","Fnlwght","Education","EducationNum","MaritalStatus","Occupation"   
  ,"Relationship" , "Race","Gender","CapitalGain","CapitalLoss","HoursPerWeek","NativeCountry","Salary"  )

table(census$Workclass)

                ?       Federal-gov         Local-gov      Never-worked           Private      Self-emp-inc 
             1836               960              2093                 7             22696              1116 
 Self-emp-not-inc         State-gov       Without-pay 
             2541              1298                14 

x

<-ifelse(census$Workclass=="?",NA,census$Workclass)
 table(x)
x
    1     2     3     4     5     6     7     8     9 
 1836   960  2093     7 22696  1116  2541  1298    14

but it did not work.

Please help.

回答1:

look at gsub

census$x <- gsub("?",NA,census$x, fixed = TRUE)

edit: forgot to add fixed = TRUE

As Richard pointed out, this will catch all occurrences of a ?

回答2:

Here's an easy way to replace " ?" with NA in all columns.

# find elements
idx <- census == " ?"
# replace elements with NA
is.na(census) <- idx

How it works?

The command idx <- census == " ?" creates a logical matrix with the same numbers of rows and columns as the data frame census. This matrix idx contains TRUE where census contains " ?" and FALSE at the other positions.

The matrix idx is used as an index. The command is.na(census) <- idx is used to replace values in census at the positions in idx with NA.

Note that the function is.na<- is used here. It is not identical with the is.na function.

来源：https://stackoverflow.com/questions/28061122/how-do-i-remove-question-mark-from-a-data-set-in-r

标签