问题
Here is an example of a dataframe I'm working on :
id string
1 no
1 yes
1 yes
2 no
2 yes
3 yes
3 yes
3 no
I want to extract the id
for which the last two rows contain the string "yes"
for the column string
.
So the results would be :
id string
1 yes
1 yes
And I would have only one id
which will be 1
.
I tried to do this with a for loop but since I have more than 200 000 lines, the loop is taking too much time : more than 5 minutes.
I tried this :
vec_id <- unique(df$id)
for(id in vec_id){
if( tail(df[which(df$id == id),"string"])[1] & tail(df[which(df$id == id),"string"])[2] ){
vec_id <- append(vec_id, id)
}
Are there any functions or ways to do this task more fastly ?
回答1:
We can use data.table
. Convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by 'id', if
all
the 'string' from the last two observations are 'yes' then get the last two 'string' (using tail
).
library(data.table)
setDT(df1)[, if(all(tail(string,2)=="yes")) .(string = tail(string,2)) , id]
# id string
#1: 1 yes
#2: 1 yes
NOTE: The data.table syntax is often data.table[i, j, by]
.
回答2:
A base R alternative is to use split
and lapply
(with unlist
) to construct a logical vector that can be used to perform the row subsetting:
dropper <- unlist(lapply(split(df$string, df$id),
FUN=function(i) c(rep(FALSE, length(i) - 2),
rep(all(tail(i, 2) =="yes"), 2))),
use.names=FALSE)
dropper
FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
Here, split
splits the df$string
into a list by df$id
which is fed to an anonymous function by lapply
. The function returns FALSE for the first n-2 elements and then either returns TRUE TRUE or FALSE FALSE for the final two elements depending on whether they are both "yes."
then use the vector to drop unwanted observations.
df[dropper,]
id string
2 1 yes
3 1 yes
来源:https://stackoverflow.com/questions/42860953/extract-id-with-matching-pattern-on-several-rows-in-dataframe