Using if else on a dataframe across multiple columns

半世苍凉 提交于 2019-12-30 11:01:13

问题


I have a large dataset of samples with descriptors of whether the sample is viable - it looks (kind of) like this, where 'desc' is the description column and 'blank' indicates the sample is not viable:

     desc        x        y        z
1   blank 4.529976 5.297952 5.581013
2   blank 5.906855 4.557389 4.901660
3  sample 4.322014 4.798248 4.995959
4  sample 3.997565 5.975604 7.160871
5   blank 4.898922 7.666193 5.551385
6   blank 5.667884 5.195825 5.232072
7   blank 5.524773 6.726074 4.767475
8  sample 4.382937 5.926217 5.203737
9  sample 4.976908 3.079191 4.614121
10  blank 4.572954 4.772373 6.077195

I want to use an if else statement to set the rows with unuseable data to NA. The final data set should look like this:

     desc        x        y        z
1   blank       NA       NA       NA
2   blank       NA       NA       NA
3  sample 4.322014 4.798248 4.995959
4  sample 3.997565 5.975604 7.160871
5   blank       NA       NA       NA
6   blank       NA       NA       NA
7   blank       NA       NA       NA
8  sample 4.382937 5.926217 5.203737
9  sample 4.976908 3.079191 4.614121
10  blank       NA       NA       NA 

I have tried a for loop, but I'm having trouble getting the for-loop to change all the columns in one loop. My real dataset has 40 columns, so I'd rather not have to process it in separate loops! Here is the code to change one column at a time:

for(i in 1:length(desc)){
    if(dat$desc[i] =="blank"){
    dat$x[i] <- NA
    } 
    else {
    dat$x[i] <- dat$x[i]
    }
}

I made the sample data with this script:

desc <- c("blank", "blank", "sample", "sample", "blank", "blank", "blank",    "sample", "sample", "blank")
x <-  rnorm(10, mean=5, sd=1)
y <-  rnorm(10, mean=5, sd=1)
z <-  rnorm(10, mean=5, sd=1)

dat <- data.frame(desc,x,y,z)

Sorry if this is a basic question, I've spent all morning looking at forums and haven't been able to find a solution.

Any help is much appreciated!


回答1:


For your example dataset this will work;

Option 1, name the columns to change:

dat[which(dat$desc == "blank"), c("x", "y", "z")] <- NA

In your actual data with 40 columns, if you just want to set the last 39 columns to NA, then the following may be simpler than naming each of the columns to change;

Option 2, select columns using a range:

dat[which(dat$desc == "blank"), 2:40] <- NA

Option 3, exclude the 1st column:

dat[which(dat$desc == "blank"), -1] <- NA

Option 4, exclude a named column:

dat[which(dat$desc == "blank"), !names(dat) %in% "desc"] <- NA

As you can see, there are many ways to do this kind of operation (this is far from a complete list), and understanding how each of these options works will help you to get a better understanding of the language.




回答2:


Using your first initial approach with loops I figured out this:

    for(i in 1:nrow(dat)){
  if(dat[i, 1] =="blank"){
    dat[i, 2:4] <- NA
  } 
  else {
    dat[i,length(dat)] <- dat[i, length(dat)]
  }
}

I tested it with your data and worked. Hope this is useful for everyone dealing with loops in rows and columns with conditions.




回答3:


You can use dplyr and a custom function to mutate values on certain conditions.

`

library(dplyr)
mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
        condition <- eval(substitute(condition), .data, envir)
        .data[condition, ] <- .data[condition, ] %>% mutate(...)
        .data
}
data <- data %>% 
mutate_cond( desc == "blank", x = NA, y = NA, z = NA)

`




回答4:


Here's another dplyr solution with a small custom function and mutate_each().

library(dplyr)

f <- function(x) if_else(dat$desc == "blank", NA_real_, x)
dat %>% 
  mutate_each(funs(f), -desc)
#>      desc        x        y        z
#> 1   blank       NA       NA       NA
#> 2   blank       NA       NA       NA
#> 3  sample 3.624941 6.430955 5.486632
#> 4  sample 3.236359 4.935453 4.319202
#> 5   blank       NA       NA       NA
#> 6   blank       NA       NA       NA
#> 7   blank       NA       NA       NA
#> 8  sample 5.058725 6.751650 4.750529
#> 9  sample 5.837206 4.323562 4.914780
#> 10  blank       NA       NA       NA



回答5:


Here is an option using set from data.table. It should be faster as the overhead of [.data.table is avoided. We convert the 'data.frame' to 'data.table' (setDT(df1)), loop through the column names of 'df1' (excluding the 'desc' column'), assign the elements to "NA" where the logical condition is 'i' is met.

library(data.table)
setDT(df1)
for(j in names(df1)[-1]){
   set(df1, i= which(df1[["desc"]]=="blank"), j= j, value= NA)
}
df1
#      desc        x        y        z
# 1:  blank       NA       NA       NA
# 2:  blank       NA       NA       NA
# 3: sample 4.322014 4.798248 4.995959
# 4: sample 3.997565 5.975604 7.160871
# 5:  blank       NA       NA       NA
# 6:  blank       NA       NA       NA
# 7:  blank       NA       NA       NA
# 8: sample 4.382937 5.926217 5.203737
# 9: sample 4.976908 3.079191 4.614121
#10:  blank       NA       NA       NA

Or another option (based on @dww's comment)

setDT(df1, key = "desc")["blank", names(df1)[-1] := NA][]



回答6:


This should work. Though honestly, if the data is unusable, why not delete the rows altogether?

library(dplyr)

blanks = 
  dat %>%
  filter(desc == "blank") %>%
  select(desc)

dat %>%
  filter(desc == "sample") %>%
  bind_rows(blanks)


来源:https://stackoverflow.com/questions/37313347/using-if-else-on-a-dataframe-across-multiple-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!