My question is related to assignment by reference versus copying in data.table. I want to know if one can delete rows by reference, similar to
Here are some strategies I have used. I believe a .ROW function may be coming. None of these approaches below are fast. These are some strategies a little beyond subsets or filtering. I tried to think like dba just trying to clean up data. As noted above, you can select or remove rows in data.table:
data(iris)
iris <- data.table(iris)
iris[3] # Select row three
iris[-3] # Remove row three
You can also use .SD to select or remove rows:
iris[,.SD[3]] # Select row three
iris[,.SD[3:6],by=,.(Species)] # Select row 3 - 6 for each Species
iris[,.SD[-3]] # Remove row three
iris[,.SD[-3:-6],by=,.(Species)] # Remove row 3 - 6 for each Species
Note: .SD creates a subset of the original data and allows you to do quite a bit of work in j or subsequent data.table. See https://stackoverflow.com/a/47406952/305675. Here I ordered my irises by Sepal Length, take a specified Sepal.Length as minimum,select the top three (by Sepal Length) of all Species and return all accompanying data:
iris[order(-Sepal.Length)][Sepal.Length > 3,.SD[1:3],by=,.(Species)]
The approaches above all reorder a data.table sequentially when removing rows. You can transpose a data.table and remove or replace the old rows which are now transposed columns. When using ':=NULL' to remove a transposed row, the subsequent column name is removed as well:
m_iris <- data.table(t(iris))[,V3:=NULL] # V3 column removed
d_iris <- data.table(t(iris))[,V3:=V2] # V3 column replaced with V2
When you transpose the data.frame back to a data.table, you may want to rename from the original data.table and restore class attributes in the case of deletion. Applying ":=NULL" to a now transposed data.table creates all character classes.
m_iris <- data.table(t(d_iris));
setnames(d_iris,names(iris))
d_iris <- data.table(t(m_iris));
setnames(m_iris,names(iris))
You may just want to remove duplicate rows which you can do with or without a Key:
d_iris[,Key:=paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)]
d_iris[!duplicated(Key),]
d_iris[!duplicated(paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)),]
It is also possible to add an incremental counter with '.I'. You can then search for duplicated keys or fields and remove them by removing the record with the counter. This is computationally expensive, but has some advantages since you can print the lines to be removed.
d_iris[,I:=.I,] # add a counter field
d_iris[,Key:=paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)]
for(i in d_iris[duplicated(Key),I]) {print(i)} # See lines with duplicated Key or Field
for(i in d_iris[duplicated(Key),I]) {d_iris <- d_iris[!I == i,]} # Remove lines with duplicated Key or any particular field.
You can also just fill a row with 0s or NAs and then use an i query to delete them:
X
x v foo
1: c 8 4
2: b 7 2
X[1] <- c(0)
X
x v foo
1: 0 0 0
2: b 7 2
X[2] <- c(NA)
X
x v foo
1: 0 0 0
2: NA NA NA
X <- X[x != 0,]
X <- X[!is.na(x),]