问题
This question sounds to be partially answered here but this is not enough specific to me. I would like to understand better when an object is updated by reference and when it is copied.
The simpler example is vector growing. The following code is blazingly inefficient in R because the memory is not allocated before the loop and a copy is made at each iteration.
x = runif(10)
y = c()
for(i in 2:length(x))
y = c(y, x[i] - x[i-1])
Allocating the memory enable to reserve some memory without reallocating the memory at each iteration. Thus this code is drastically faster especially with long vectors.
x = runif(10)
y = numeric(length(x))
for(i in 2:length(x))
y[i] = x[i] - x[i-1]
And here comes my question. Actually when a vector is updated it does move. There is a copy that is made as shown below.
a = 1:10
pryr::tracemem(a)
[1] "<0xf34a268>"
a[1] <- 0L
tracemem[0xf34a268 -> 0x4ab0c3f8]:
a[3] <-0L
tracemem[0x4ab0c3f8 -> 0xf2b0a48]:
But in a loop this copy does not occur
y = numeric(length(x))
for(i in 2:length(x))
{
y[i] = x[i] - x[i-1]
print(address(y))
}
Gives
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
I understand why a code is slow or fast as a function of the memory allocations but I don't understand the R logic. Why and how, for the same statement, in a case the update is made by reference and in the other case the update in made by copy. In the general case how can we know what will happen.
回答1:
This is covered in Hadley's Advanced R book. In it he says (paraphrasing here) that whenever 2 or more variables point to the same object, R will make a copy and then modify that copy. Before going into examples, one important note which is also mentioned in Hadley's book is that when you're using RStudio
the environment browser makes a reference to every object you create on the command line.
Given your observed behavior, I'm assuming you're using RStudio
which we will see will explain why there are actually 2 variables pointing to a
instead of 1 like you might expect.
The function we'll use to check how many variables are pointing to an object is refs()
. In the first example you posted you can see:
library(pryr)
a = 1:10
refs(x)
#[1] 2
This suggests (which is what you found) that 2 variables are pointing to a
and thus any modification to a
will result in R copying it, then modifying that copy.
Checking the for loop
we can see that y
always has the same address and that refs(y) = 1
in the for loop. y
is not copied because there are no other references pointing to y
in your function y[i] = x[i] - x[i-1]
:
for(i in 2:length(x))
{
y[i] = x[i] - x[i-1]
print(c(address(y), refs(y)))
}
#[1] "0x19c3a230" "1"
#[1] "0x19c3a230" "1"
#[1] "0x19c3a230" "1"
#[1] "0x19c3a230" "1"
#[1] "0x19c3a230" "1"
#[1] "0x19c3a230" "1"
#[1] "0x19c3a230" "1"
#[1] "0x19c3a230" "1"
#[1] "0x19c3a230" "1"
On the other hand if introduce a non-primitive function of y
in your for loop
you would see that address of y
changes each time which is more in line with what we would expect:
is.primitive(lag)
#[1] FALSE
for(i in 2:length(x))
{
y[i] = lag(y)[i]
print(c(address(y), refs(y)))
}
#[1] "0x19b31600" "1"
#[1] "0x19b31948" "1"
#[1] "0x19b2f4a8" "1"
#[1] "0x19b2d2f8" "1"
#[1] "0x19b299d0" "1"
#[1] "0x19b1bf58" "1"
#[1] "0x19ae2370" "1"
#[1] "0x19a649e8" "1"
#[1] "0x198cccf0" "1"
Note the emphasis on non-primitive. If your function of y
is primitive such as -
like: y[i] = y[i] - y[i-1]
R can optimize this to avoid copying.
Credit to @duckmayr for helping explain the behavior inside the for loop.
回答2:
I complete the @MikeH. awnser with this code
library(pryr)
x = runif(10)
y = numeric(length(x))
print(c(address(y), refs(y)))
for(i in 2:length(x))
{
y[i] = x[i] - x[i-1]
print(c(address(y), refs(y)))
}
print(c(address(y), refs(y)))
The output shows clearly what happened
[1] "0x7872180" "2"
[1] "0x765b860" "1"
[1] "0x765b860" "1"
[1] "0x765b860" "1"
[1] "0x765b860" "1"
[1] "0x765b860" "1"
[1] "0x765b860" "1"
[1] "0x765b860" "1"
[1] "0x765b860" "1"
[1] "0x765b860" "1"
[1] "0x765b860" "2"
There is a copy at the first iteration. Indeed because of Rstudio there are 2 refs. But after this first copy y
belongs in the loops and is not available into the global environment. Then, Rstudio does not create any additional refs and thus no copy is made during the next updates. y
is updated by reference. On loop exit y
become available in the global environment. Rstudio creates an extra refs but this action does not change the address obviously.
来源:https://stackoverflow.com/questions/48230311/copy-on-modify-semantic-on-a-vector-does-not-append-in-a-loop-why