问题
I am trying to find a way to determine when a set of columns changes value in a data.frame. Let me get straight to the point, please consider the following example:
x<-data.frame(cnt=1:10, code=rep('ELEMENT 1',10), val0=rep(5,10), val1=rep(6,10),val2=rep(3,10))
x[4,]$val0=6
- The cnt column is a unique ID (could be a date, or time column, for simplicity it's an int here)
- The code column is like an code for the set of rows (imagine several such groups but with different codes). The code and cnt are the keys in my data.table.
- The val0,val1,val2 columns are something like scores.
The data.frame above should be read as: The scores for 'ELEMENT 1' started as 5,6,3, remained as is until the 4 iteration when they changed to 6,6,3, and then changed back to 5,6,3.
My question, is there a way to get the 1st, 4th, and 5th row of the data.frame? Is there a way to detect when the columns change? (There are 12 columns btw)
I tried using the duplicated of data.table (which worked perfectly in the majority of the cases) but in this case it will remove all duplicates and leave rows 1 and 4 only (removing the 5th).
Do you have any suggestions? I would rather not use a for loop as there are approx. 2M lines.
回答1:
In data.table
version 1.8.10 (stable version in CRAN), there's a(n) (unexported) function called duplist
that does exactly this. And it's also written in C and is therefore terribly fast.
require(data.table) # 1.8.10
data.table:::duplist(x[, 3:5])
# [1] 1 4 5
If you're using the development version of data.table
(1.8.11), then there's a more efficient version (in terms of memory) renamed as uniqlist
, that does exactly the same job. Probably this should be exported for next release. Seems to have come up on SO more than once. Let's see.
require(data.table) # 1.8.11
data.table:::uniqlist(x[, 3:5])
# [1] 1 4 5
回答2:
Totally unreadable, but:
c(1,which(rowSums(sapply(x[,grep('val',names(x))],diff))!=0)+1)
# [1] 1 4 5
Basically, run diff
on each row, to find all the changes. If a change occurs in any column, then a change has occurred in the row.
Also, without the sapply
:
c(1,which(rowSums(diff(as.matrix(x[,grep('val',names(x))])))!=0)+1)
来源:https://stackoverflow.com/questions/21266015/determine-when-columns-of-a-data-frame-change-value-and-return-indices-of-the-ch