问题
(This is a follow up question to this.)
Check this toy code:
> x <- data.frame(a = 1:2)
> foo <- function(z) { setDT(z) ; z[, b:=3:4] ; z }
> y <- foo(x)
>
> class(x)
[1] "data.table" "data.frame"
> x
a
1: 1
2: 2
It looks like setDT did change x's class, but the addition of data did not apply to x.
What happened here?
回答1:
In your function z
is a reference to x
until setDT
.
library(data.table)
foo <- function(z) {print(address(z)); setDT(z); print(address(z))}
x <- data.frame(a = 1:2)
address(x)
#[1] "0x555ec9a471e8"
foo(x)
#[1] "0x555ec9a471e8"
#[1] "0x555ec9ede300"
In setDT
it comes to the following line where z
is still pointing to the same address like x
:
setattr(z, "class", data.table:::.resetclass(z, "data.frame"))
setattr
does not make a copy. So x
and z
are still pointing to the same address and both are now of class data.frame
:
x <- data.frame(a = 1:2)
z <- x
class(x)
#[1] "data.frame"
address(x)
#[1] "0x555ec95de600"
address(z)
#[1] "0x555ec95de600"
setattr(z, "class", data.table:::.resetclass(z, "data.frame"))
class(x)
#[1] "data.table" "data.frame"
address(x)
#[1] "0x555ec95de600"
address(z)
#[1] "0x555ec95de600"
Then setalloccol
is called which calls in this case:
assign("z", .Call(data.table:::Calloccolwrapper, z, 1024, FALSE))
which now let x
and z
point to different addresses.
address(x)
#[1] "0x555ecaa09c00"
address(z)
#[1] "0x555ec95de600"
And both have the class
data.frame
class(x)
#[1] "data.table" "data.frame"
class(z)
#[1] "data.table" "data.frame"
I think when they would have used
class(z) <- data.table:::.resetclass(z, "data.frame")
instead of
setattr(z, "class", data.table:::.resetclass(z, "data.frame"))
the problem would not occur.
x <- data.frame(a = 1:2)
z <- x
address(x)
#[1] "0x555ec9cd2228"
class(z) <- data.table:::.resetclass(z, "data.frame")
class(x)
#[1] "data.frame"
class(z)
#[1] "data.table" "data.frame"
address(x)
#[1] "0x555ec9cd2228"
address(z)
#[1] "0x555ec9cd65a8"
but after class(z) <- value
z
will not point to the same address where it points before:
z <- data.frame(a = 1:2)
address(z)
#[1] "0x5653dbe72b68"
address(z$a)
#[1] "0x5653db82e140"
class(z) <- c("data.table", "data.frame")
address(z)
#[1] "0x5653dbe82d98"
address(z$a)
#[1] "0x5653db82e140"
but after setDT
it will also not point to the same address where it points before:
z <- data.frame(a = 1:2)
address(z)
#[1] "0x55b6f04d0db8"
setDT(z)
address(z)
#[1] "0x55b6efe1e0e0"
As @Matt-dowle pointed out, it is also possible to change the data in x
over z
:
x <- data.frame(a = c(1,3))
z <- x
setDT(z)
z[, b:=3:4]
z[2, a:=7]
z
# a b
#1: 1 3
#2: 7 4
x
# a
#1: 1
#2: 7
R.version.string
#[1] "R version 4.0.2 (2020-06-22)"
packageVersion("data.table")
#[1] ‘1.12.8’
回答2:
library(data.table)
x <- data.frame(a = 1:2)
y <- x #y is a reference to x
address(x)
#[1] "0x55e07e31a1e8"
address(y)
#[1] "0x55e07e31a1e8"
setDT(y) #Add data.table to attr of y AND x, create a copy of it and let y point to it and make y a DT
address(x)
#[1] "0x55e07e31a1e8"
address(y)
#[1] "0x55e07e7b1300"
class(x)
#[1] "data.table" "data.frame"
x[, b:=3:4]
#Warnmeldung:
#In `[.data.table`(x, , `:=`(b, 3:4)) :
# Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
z <- data.frame(a = 1:2)
class(z) <- c("data.table", "data.frame")
z[, b:=3:4]
#Warnmeldung:
#In `[.data.table`(x, , `:=`(b, 3:4)) :
# Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
来源:https://stackoverflow.com/questions/62775221/r-data-table-weird-value-reference-semantics