问题
I have two vectors:
x<-c(0,1,0,2,3,0,1,1,0,2)
y<-c("00:01:00","00:02:00","00:03:00","00:04:00","00:05:00",
"00:06:00","00:07:00","00:08:00","00:09:00","00:10:00")
I need to choose only those in y
, where values of x
is not interrupted by 0. As a result, I'd like to get a dataframe like this
y x
00:04:00 2
00:05:00 3
00:07:00 1
00:08:00 1
We built a script like this, but with a big dataset it takes time. Is there a more elegant solution? And I wonder, why df<-rbind(bbb,df)
returns inverted df?
aaa<-data.frame(y,x)
df<-NULL
for (i in 1:length(aaa$x)){
bbb<-ifelse((aaa$x[i]*aaa$x[i+1])!=0,
aaa$x[i],
ifelse((aaa$x[i]*aaa$x[i-1])!=0,
aaa$x[i],
NA))
df<-rbind(bbb,df)
}
df<-data.frame(rev(df))
aaa$x<-df$rev.df.
bbb<-na.omit(aaa)
bbb
I'm a newbie in R, so please, as much detail as you can :) Thank you!
回答1:
aaa <- data.frame(y,x)
rles <- rle(aaa$x == 0)
bbb <- aaa[rep(rles$values == FALSE & rles$lengths >= 2, rles$lengths),]
which gives
> bbb
y x
4 00:04:00 2
5 00:05:00 3
7 00:07:00 1
8 00:08:00 1
The sub-question you had: df<-rbind(bbb,df)
returns df
reversed because you are adding the new row (bbb
) before the rest (existing) rows; invert the order of the arguments and you won't need to reverse df
.
Now to break down the answer, since it involves a lot of parts. First, rephrasing your criteria, you want stretches of aaa
that don't have 0's for at least 2 rows. So the first criteria is finding the 0's
> aaa$x == 0
[1] TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
Then you want to figure out the length of each of these stretches; rle
does this.
> rle(aaa$x == 0)
Run Length Encoding
lengths: int [1:8] 1 1 1 2 1 2 1 1
values : logi [1:8] TRUE FALSE TRUE FALSE TRUE FALSE ...
This means there was 1 TRUE
, then 1 FALSE
, then 1 TRUE
, then 2 FALSE
s, etc. This result is assigned to rles
. The parts you want are where the value is FALSE
(not 0), and the length of that run is 2 or more.
> rles$values == FALSE & rles$lengths >= 2
[1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
This needs to be expanded back out to the length of aaa
, and rep
will do that, using the rles$lengths
to replicate the appropriate entries.
> rep(rles$values == FALSE & rles$lengths >= 2, rles$lengths)
[1] FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
This gives a logical vector appropriate for indexing aaa
> aaa[rep(rles$values == FALSE & rles$lengths >= 2, rles$lengths),]
y x
4 00:04:00 2
5 00:05:00 3
7 00:07:00 1
8 00:08:00 1
来源:https://stackoverflow.com/questions/12790378/how-to-choose-non-interruped-numbers-only