问题
I'm trying add a new column to a data frame based on several conditions from other columns. I have the following data:
> commute <- c("walk", "bike", "subway", "drive", "ferry", "walk", "bike", "subway", "drive", "ferry", "walk", "bike", "subway", "drive", "ferry")
> kids <- c("Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "No", "Yes")
> distance <- c(1, 12, 5, 25, 7, 2, "", 8, 19, 7, "", 4, 16, 12, 7)
>
> df = data.frame(commute, kids, distance)
> df
commute kids distance
1 walk Yes 1
2 bike Yes 12
3 subway No 5
4 drive No 25
5 ferry Yes 7
6 walk Yes 2
7 bike No
8 subway No 8
9 drive Yes 19
10 ferry Yes 7
11 walk No
12 bike No 4
13 subway Yes 16
14 drive No 12
15 ferry Yes 7
If the following three conditions are met:
commute = walk OR bike OR subway OR ferry
AND
kids = Yes
AND
distance is less than 10
Then I'd like a new column called get.flyer to equal "Yes". The final data frame should look like this:
commute kids distance get.flyer
1 walk Yes 1 Yes
2 bike Yes 12 Yes
3 subway No 5
4 drive No 25
5 ferry Yes 7 Yes
6 walk Yes 2 Yes
7 bike No
8 subway No 8
9 drive Yes 19
10 ferry Yes 7 Yes
11 walk No
12 bike No 4
13 subway Yes 16 Yes
14 drive No 12
15 ferry Yes 7 Yes
回答1:
We can use %in% for comparing multiple elements in a column, & to check if both conditions are TRUE.
library(dplyr)
df %>%
mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
as.character(kids) == "Yes" &
as.numeric(as.character(distance)) < 10)+1] )
It is better to create the data.frame with stringsAsFactors=FALSE as by default it is TRUE. If we check the str(df), we can find that all the columns are factor class. Also, if there are missing values, instead of "", NA can be used to avoid converting the class of a numeric column to something else.
If we rewrite the creation of 'df'
distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)
df1 <- data.frame(commute, kids, distance, stringsAsFactors=FALSE)
the above code can be simplified
df1 %>%
mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10)+1] )
For better understanding, some people prefer ifelse
df1 %>%
mutate(get.flyer = ifelse(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10,
"Yes", ""))
This can be also done easily with base R methods
df1$get.flyer <- with(df1, ifelse(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10,
"Yes", ""))
回答2:
The solution is already pointed out by @akrun. I'd like to present it in a more 'wrapped up' way.
You can use the ifelse statement to create a column based on one (or more) conditions. But first you have to change the 'encoding' of missing values in the distance column. You used "" to indicate a missing value, this however converts the entire column to string and inhibits numerical comparison (distance < 10 is not possible). The R way of indicating a missing value is NA, your column definition of distance should be:
distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)
The ifelse statement then looks like this:
df$get.flyer <- ifelse(
(
(df$commute %in% c("walk", "bike", "subway", "ferry")) &
(df$kids == "Yes") &
(df$distance < 10)
),
1, # if condition is met, put 1
0 # else put 0
)
Optional: Consider encoding your other columns in a different way as well:
- you could use
TRUEandFALSEinstead of "Yes" and "No" for thekidsvariable - you could use a
factorfor commute
回答3:
Example, check if first_column_name is contained in second_column_name and write result to new_column
df$new_column <- apply(df, 1, function(x) grepl(x['first_column_name'], x['second_column_name'], fixed = TRUE))
Details:
df$new_column <- # create a new column with name new_column on df
apply(df, 1 # `1` means for each row, `apply(df` means apply the following function on df
function(x) # Function definition to apply on each row, `x` means input row for each row.
grepl(x['first_column_name'], x['second_column_name'], fixed = TRUE)) # Body of function to apply, basically run grepl to find if first_column_name is in second_column_name, fixed = TRUE means don't use regular expression just the plain text from first_column_name.
来源:https://stackoverflow.com/questions/39405628/how-do-i-create-a-new-column-based-on-multiple-conditions-from-multiple-columns